Skip to content

Add Azure Linux 4.0 support#698

Open
vinceaperri wants to merge 7 commits into
mainfrom
user/vinceaperri/azl4-support
Open

Add Azure Linux 4.0 support#698
vinceaperri wants to merge 7 commits into
mainfrom
user/vinceaperri/azl4-support

Conversation

@vinceaperri
Copy link
Copy Markdown
Contributor

@vinceaperri vinceaperri commented Apr 14, 2026

  • Download and test azl4 test image in functional and vm test workflows (core-efi only, legacy option unavailable)

  • Add azl4 to distribution-support table (all yes except --package-snapshot-time and os.packages.shapshotTime: the package snapshot time APIs require tdnf but azl4 uses dnf)

  • Add azl4 to the developer guide (no link because it hasn't been released yet)

  • Add azl4 to the README as a supported input image

  • Add azl4 vm tests (equivalents of the existing azl3 vm tests)

    • test_min_change_efi_azl4_iso_full_os_output
    • test_min_change_efi_azl4_iso_bootstrap_output
    • test_min_change_efi_azl4_qcow_output
    • test_create_image_efi_qcow_output_azl4
  • Add azl4 to:

    • azureLinuxDistroHandler.GetTargetOs()
    • newAzureLinuxDistroHandler()z
    • NewDistroHandlerFromTargetOs()
    • getFileSystemOptions()
    • GetInstalledTargetOs()
    • TestCreateImageRaw()
    • TestCreateImageBtrfs()
    • TestCreateImageRawNoTar()
    • TestCreateImage_OutputImageFileAsRelativePath()
    • TestValidateCreateImageOutput_AcceptsValidPaths()
    • TestValidateCreateImageConfig_EmptyPackagestoInstall()
    • TestBootCustomizer*() family of tests
  • Create baseImageAzureLinuxCoreEfiAll var containing both azl3 and azl4 core-efi images, to migrate all tests that only test testBaseImageAzl3CoreEfi to test baseImageAzureLinuxCoreEfiAll as well (enabling additional tests)

    • Notes:
      • TestCustomizeImagePackagesUrlSource is always skipped for azl4, since cloud-native-azl4.repo but hasn't yet been added yet, as the repository does not yet exist for azl4.
      • TestCustomizeImagePackagesSnapshotTime, TestCustomizeImagePackagesCliSnapshotTimeOverridesConfigFile, and TestCustomizeImagePackagesSnapshotTimeWithoutPreviewFlagFails are explicitly skipped for azl4, since it doesn't currently support this feature. The feature (added to azl3's tdnf by Microsoft but not upstreamed) would need to be ported to azl4's dnf before we could add support.
  • Add azl4 core-efi image to baseImageAzureLinuxAll (enables a ton of tests that currently test azl3)

  • Split functional tests for AZL in Build (dev) workflow into 4 and 3+2 to enable azl4 testing in main builds

  • No longer skip TestCustomizeImageOverlaysSELinux/AzureLinux3CoreEfi

  • Review notes:

    • New configs were created for azl4 only when changes were backwards-incompatible with azl3 (e.g. kernel version differences)
    • skip_if_unavailable=False is set when refreshing azl4 package metadata to match tdnf behavior in azl3, which errors if repo isn't available (this is explicitly tested for in TestCustomizeImagePackagesBadRepo)
    • We check for systemd-boot-unsigned instead of systemd-boot in validateUkiDependencies() for azl4 because it hasn't been released yet (when released, will converge on using systemd-boot)
    • Switching (again) from unzip to wget in tests since azl4 has unzip installed
  • Follow up tasks:

    • Add package snapshot time support to AZL4's dnf (enables package snapshot time feature)
    • Add cloud-native-azl4.repo when it becomes available (enables additional test coverage)
    • Remove systemd-boot-unsigned for azl4 when systemd-boot becomes available (enables secure boot)
    • Upstream ukify patch (see Patch ukify (AZL3) step in the pipelines)
    • Enable TestCustomizeImageMultiKernel for AZL4 when its repos contain multiple kernels (add the newer kernel to multikernel-azl4.yaml)
    • Remove the workaround to remove grub2-efi-x64/aa64 on AZL4 after upgrade to shim 16.1+ (Fedora 44+)
    • Onboard a bare-metal image for azl4 (when available)
    • Add support for ISO and PXE-* outputs

@vinceaperri vinceaperri requested a review from a team as a code owner April 14, 2026 16:26
Comment thread toolkit/tools/pkg/imagecustomizerlib/baseconfigs_test.go
Comment thread toolkit/tools/pkg/imagecustomizerlib/customizeoverlays_test.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/customizepackages_test.go Outdated
@liulanze
Copy link
Copy Markdown
Contributor

Please add a PR description for future reference.

Comment thread test/vmtests/vmtests/imagecustomizer/test_min_change.py
Comment thread docs/imagecustomizer/developer-guide.md
@vinceaperri vinceaperri force-pushed the user/vinceaperri/azl4-support branch from f0c4779 to 4c26a66 Compare April 15, 2026 22:59
Comment thread toolkit/tools/pkg/imagecustomizerlib/artifactsinputoutput_test.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/main_test.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/main_test.go
Comment thread toolkit/tools/pkg/imagecustomizerlib/distrohandler_fedora.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/distrohandler_azurelinux.go Outdated
Comment thread toolkit/tools/imagegen/diskutils/filesystem.go
Comment thread toolkit/tools/imagegen/diskutils/filesystem.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/customizeoverlays_test.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/distrohandler_fedora.go
Comment thread toolkit/tools/pkg/imagecustomizerlib/distrohandler_fedora.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/distrohandler_acl.go Outdated
Comment thread toolkit/tools/pkg/imagecustomizerlib/customizeuki.go

options := strings.TrimSpace(value)
var filteredTokens []string
for _, token := range strings.Fields(options) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🔴 Blocking, High confidence — flagged by all 3 skeptic models)

updateBLSEntryOptions mishandles whitespace/quoting — read/write asymmetry corrupts BLS round-trips. Three independent issues:

  1. Quoting: strings.Fields(options) splits on whitespace only. The reader (readKernelCmdlinesFromBLSEntries) uses grub.TokenizeConfig which handles quoting. A value like rd.cmdline="foo bar" gets split into two tokens on write.

  2. Tab/space detection (lines 303-305, 317-319): strings.Cut(trimmed, " ") then fallback \t. A line like "options\tfoo=bar" produces key="options\tfoo=bar" (no space match), the options line is silently skipped, and a new one is appended. Multiple options lines per spec means stale verity args coexist with new ones.

  3. Trailing newline: strings.Split/strings.Join on "\n" may add or drop the trailing newline depending on input. Most tools tolerate it; signed BLS entries (UKI+pcrphase scenarios when SB lands) won't.

Fix: unify on grub.TokenizeConfig / ParseCommandLineArgs for both read and write. Use strings.IndexFunc(trimmed, unicode.IsSpace) for line-key detection. Preserve trailing newline.

return fmt.Errorf("failed to read BLS entry file (%s):\n%w", absPath, err)
}

updatedContent, err := updateBLSEntryOptions(string(content), argsToRemove, newArgs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🔴 Blocking, Medium confidence)

Verity BLS update touches recovery/rescue entries. updateBLSEntriesForVerity iterates every .conf file (line 240) and calls updateBLSEntryOptions here with no recovery filter. Recovery entries get root=/dev/mapper/root and verity args injected. The recovery initramfs may not contain verity dracut modules, breaking the rescue path — defeating its purpose.

Recovery-skip logic exists in the read path (customizeuki.go:790) but not here.

Fix: apply the same recovery filter before mutation. Match both recovery and rescue, case-insensitively (the read-path filter is also too narrow — see separate comment at customizeuki.go:790).

}
rootDevice := argMap["root"]
delete(argMap, "root")
for name, value := range argMap {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🟡 Non-blocking, Medium confidence)

Non-deterministic kernel arg ordering — reproducibility regression. argMap from ReadNonRecoveryKernelCmdlines is a map[string]string. Go map iteration is randomized per spec. Each osmodifier run produces a different key order in GRUB_CMDLINE_LINUX; grub2-mkconfig reproduces that into the resulting grub.cfg, making images non-reproducible across osmodifier invocations.

The pre-PR code iterated argTokens (deterministic grub.cfg order). Also: the previous len(lines) != 1 safety check that errored on multi-kernel images is gone — the new code silently picks whichever non-recovery entry lands first.

Fix: sort values alphabetically before passing to UpdateKernelCommandLineArgs, or change ReadNonRecoveryKernelCmdlines to return []kernelArg preserving source order.

Comment thread toolkit/tools/pkg/imagecustomizerlib/packagemanager_dnf.go Outdated
}

if title == "" {
return nil, fmt.Errorf("BLS entry (%s) is missing 'title' key", absPath)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🟡 Non-blocking, Medium confidence)

title is now unconditionally required — regression vs. pre-refactor behavior. Before the unification in 3187f31d, the equivalent of extractKernelCmdlineFromBLSEntries(skipRecovery=false) only required title when skipRecovery=true. The new code rejects entries missing title for both code paths, even though the BLS spec allows it (systemd-boot falls back to filename).

Any AZL4 / Fedora image with a BLS entry missing title: fails the entire kernel-args read with no fallback. Affects extract / COSI / verity flows.

Fix: when title == "", treat the entry as non-recovery (don't error), or skip with a debug log.

Comment thread toolkit/tools/imagegen/installutils/installutils.go Outdated
return nil, fmt.Errorf("BLS entry (%s) is missing 'title' key", absPath)
}

if strings.Contains(title, "recovery") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🟡 Non-blocking, Medium confidence)

BLS recovery filter is substring-only and case-sensitive. strings.Contains(title, "recovery") misses Rescue, Recovery, distro-specific rescue naming. Combined with the missing recovery filter on the write path (see customizeverity.go:275), recovery entries can be both included in reads AND mutated in writes.

Fix: lowercase compare, match both recovery and rescue. Even better: extract a shared helper used by both the read path here and the verity update path.


return nil
return fmt.Errorf("package (%s) is not installed:\n"+
"this package must be installed to use Uki", systemdBootPackages[0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🟡 Non-blocking, Medium confidence)

Error message misnames the AZL4 package. Always reads "package (systemd-boot) is not installed" because it hardcodes systemdBootPackages[0]. On AZL4 the satisfying package is systemd-boot-unsigned; the error directs the operator to install the wrong one.

Fix:

return fmt.Errorf("package (%s) is not installed:\n"+
    "this package must be installed to use Uki", strings.Join(systemdBootPackages, " or "))

grubCfgContent, err := file.Read(grubCfgPath)
// readGrubConfigLinuxArgsBestEffort probes a boot directory for a kernel command-line source
// when no DistroHandler is available (e.g. os-release was not detectable on a USR-verity image).
func readGrubConfigLinuxArgsBestEffort(bootDir string) (map[string][]grubConfigLinuxArg, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🟡 Non-blocking, Medium confidence)

readGrubConfigLinuxArgsBestEffort doesn't try BLS. When detectDistroFromRootfs fails (e.g. verity image without os-release on rootfs), distroHandler is nil and this best-effort path probes only grub2/grub.cfg / grub/grub.cfg. AZL4 verity images (BLS-only, no /boot/grub2/grub.cfg) get fs.ErrNotExist and surface a confusing error rather than attempting BLS extraction.

Also: hybrid BLS + inline-linux grub.cfg files (legitimate on Fedora during BLS migration windows) are hard-rejected at line 1138-ish. Could prefer BLS with a debug log.

Fix: also probe boot/loader/entries/*.conf if grub.cfg probes all miss. Tolerate hybrid grub.cfg by preferring BLS.

requiredRpms := []string{"systemd-boot"}
systemdBootPackages := []string{"systemd-boot"}
if distroHandler.GetTargetOs() == targetos.TargetOsAzureLinux4 {
systemdBootPackages = []string{"systemd-boot", "systemd-boot-unsigned"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻‍❄️ Karhu deep-review (🟡 Non-blocking, Low confidence)

UKI dependency check accepts systemd-boot-unsigned with no Secure Boot guard. systemdBootPackages = {"systemd-boot", "systemd-boot-unsigned"} for AZL4; first match wins. An image with only systemd-boot-unsigned (which will fail Secure Boot verification on a SB-enabled VM) silently passes UKI validation.

Acknowledged in the PR description as a workaround pending signed systemd-boot for AZL4, but there's no assertion that prevents unsigned from leaking to a Secure Boot configuration.

Fix: either warn when only unsigned is present, or hard-error in ValidateConfig if secureBoot is requested AND only systemd-boot-unsigned would be installed.

@Britel
Copy link
Copy Markdown
Contributor

Britel commented May 14, 2026

🐻‍❄️ Karhu deep-review (🔴 Blocking, Medium confidence — verified)

toolkit/tools/internal/resources/assets/azurelinux/azurelinux-4.0/efi/grub/grub.cfg is missing the {{.CryptoMountCommand}} placeholder → encrypted boot silently no-ops on AZL4.

Posting as a general comment because GitHub won't accept a line comment on this brand-new file.

The AZL3 stub at azurelinux-3.0/efi/grub/grub.cfg:1 starts with {{.CryptoMountCommand}}. The new AZL4 stub does not. setGrubCfgEncryptedVolume (installutils.go:1319) does a sed-style replacement of the placeholder; with no placeholder present, it matches zero occurrences and still returns nil.

Result: an AZL4 image build that enables encrypted root/boot has the encryption request silently dropped at this layer. Image boots, fails to find root, no build-time warning. Same bug reaches Fedora via the asset mismatch flagged at distrohandler_fedora.go:146.

Fix: pick one:

  • Add {{.CryptoMountCommand}} to this stub (matches AZL3 behavior)
  • Make setGrubCfgEncryptedVolume error when the placeholder is absent but enableEncryptedVolume=true
  • Gate AZL4 encrypted root behind an explicit ValidateConfig 'not yet supported' error

@Britel
Copy link
Copy Markdown
Contributor

Britel commented May 14, 2026

🐻‍❄️ Karhu deep-review — suggestions (architectural / non-merge-blocking)

These are direction notes from the architect role across all three models. Independent of the line-level findings — none of these block merge, but the first cluster is worth landing before AZL5 work begins.

Architecture (post-merge, before AZL5)

  1. Split azureLinuxDistroHandler per major version (azureLinux2DistroHandler, azureLinux3DistroHandler, azureLinux4DistroHandler). Currently 8-13 raw if d.version == "4.0" branches inside one struct (lines 31, 57, 88, 97, 115, 129, 153, 182, 187, 206, 225, 239 in distrohandler_azurelinux.go). AZL2 vs AZL3 vs AZL4 differ in initramfs tool, package manager, and boot-config format — three of the biggest axes the interface abstracts. ACL got its own file for less. Doing the split with three concrete versions in hand will produce a much better interface than waiting for AZL5 to force it.

  2. Pull orchestrator GetTargetOs() == AzureLinux4 checks into interface methods. customizeuki.go:351 duplicates handler knowledge from distrohandler_azurelinux.go:97. SSoT violation. Add e.g. RequiredSystemdBootPackages() to the interface and delete the orchestrator branch.

  3. Introduce a typed DistroCapabilities / DistroProfile struct to replace stringly-typed ValidateConfig rejections (e.g. ISO/PXE blocked via fmt.Errorf at distrohandler_azurelinux.go:60, ACL has 4 similar rejections at distrohandler_acl.go:39-53). Captures unsupported features as data, not prose — greppable and removable.

  4. Eliminate ReadKernelCmdlines copy-paste. Four byte-for-byte identical 5-line implementations across distrohandler_acl.go:3914, distrohandler_azurelinux.go:3994, distrohandler_fedora.go:4054, distrohandler_ubuntu.go:4097. Either remove from interface and call grubKernelArgsToStringMap(d.ReadGrubConfigLinuxArgs(...)) at callsites, or use an embedded base handler.

  5. Rename ReadGrubConfigLinuxArgsReadBootConfigLinuxArgs. Method name says "Grub" but AZL4 implementation reads BLS entries from /boot/loader/entries/*.conf. Confusing for future readers tracing the call.

  6. Standardize targetos version match policy. AZL exact-matches "4.0", ACL prefix-matches "3.*". An AZL 4.0.x patch line would fail detection; ACL 3.0.20260421 works. Either prefix-match both with a documented "major.minor pins the handler" contract, or exact-match both.

CI & Test architecture

  1. Matrix-ify build.yml. 6 hand-rolled {azl3, azl4} × {azl3-host, ubuntu2404-host} × {amd64, arm64} job blocks should be a strategy.matrix with an include list. Will double again at AZL5.

  2. Test YAML composition. verity-*-{amd64,arm64}-{azl3,azl4}.yaml proliferation (~16+ files for the verity matrix alone) will compound at AZL5. Either generate programmatically (Go struct → YAML) or use base + arch + distro overlay composition.

Workaround hygiene (cheap, high leverage)

  1. Annotate every AZL4 workaround with a removal trigger:

    • systemd-boot-unsigned package quirk
    • skip_if_unavailable=False for DNF
    • unzip → jq test substitution
    • ukify-fix-insertion-of-padding-in-merged-sections.patch CI patch
    • shim 16.1+ workaround for grub2-efi-x64/aa64
    • ISO/PXE blocked for AZL4 in ValidateConfig

    Format: // AZL4 WORKAROUND: <why>. Remove when <condition>. Tracking: <issue link>.

    A workaround you cannot find is a workaround you cannot remove.

Pre-merge polish

  • PR description says "switching from unzip to wget in tests" but the actual final state (after f29a0192) is unzip → jq. Description is stale.
  • mountFstabPartitionReadonly in partitionutils.go has a duplicated docstring (merge artifact).
  • The global GRUB_DISABLE_OS_PROBER=true change in installutils.go:CallGrubMkconfig silently changes behavior for ALL existing AZL2/AZL3/Fedora images. It's a correctness fix (host disks should never be in a guest grub.cfg), but worth calling out in release notes so operators don't file a regression when they notice different grub.cfg output.
  • Decide whether targetos.GetInstalledTargetOs should treat unknown VARIANT_ID on Azure Linux as standard AZL or fail closed (targetos.go:54-77).

Skipped tests with no "unskip when X" trigger

Not bugs, but worth tracking explicitly:

  • TestCustomizeImagePackagesAddOfflineLocalRepoWithGpgKey — "AZL4 RPMs not yet GPG signed"
  • TestCustomizeImageMultiKernel for AZL4 — "only one kernel version available"
  • SELinux removal test for AZL4 — "selinux-policy ships by default"
  • ARM64 UKI tests skipped with AZL3-specific message but condition is version-agnostic — should narrow to AZL3 only since AZL4 ships systemd-boot-unsigned on arm64 (e.g. baseconfigs_test.go:79).

Strengths everyone agreed on

  • DistroHandler interface expansion is the right seam.
  • The os-prober env fix is real correctness — host kernel-entry leakage was a latent bug affecting all images.
  • detectDistroFromRootfs cascading /usr and /usr/lib fallback is genuinely thoughtful for USR-verity images.
  • /run/_bindmountroot improvement over /mnt/_bindmountroot collision is a clean win, with good comments.
  • dnf vs tdnf abstraction in rpmPackageManagerHandler is clean.
  • Tests are not shallow — verifyVerityGrub rewritten to call distroHandler.ReadGrubConfigLinuxArgs instead of regex parsing eliminates a test-only parser.
  • 100-commit history is CI-driven calibration, not algorithmic churn.

Per-agent reports retained locally — happy to share specific findings if useful.

return nil, fmt.Errorf("failed to find os-release file:\n%w", err)
}

// mountFstabPartitionReadonly mounts the partition referenced by fstab's mountTarget entry under
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated docstring, see line 440 and 443


// FedoraGrubCfgRelPath is the path to the grub config file on Fedora and Azure Linux systems,
// relative to the /boot directory.
FedoraGrubCfgRelPath = "grub2/grub.cfg"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-use the constants under customizeuki?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants