Skip to content

feat(linux): add build support for GB200/300 image series#8521

Open
keith-ms wants to merge 87 commits into
mainfrom
keithpimm/pull-gb-files-into-main
Open

feat(linux): add build support for GB200/300 image series#8521
keith-ms wants to merge 87 commits into
mainfrom
keithpimm/pull-gb-files-into-main

Conversation

@keith-ms
Copy link
Copy Markdown

This change pulls in the commits specific to the release-gb200 branch into main rather than trying to merge commit from main into release-gb200.

I've modified the vhdbuilder/packer/pre-install-dependencies.sh from the release-gb200 branch so that the NVIDIA kernel path from the PPA depends on the presence of the GB200 feature flag.

anson627 and others added 30 commits May 14, 2026 09:33
…E_LABELS are written to a /etc/default/kubelet
…els are written to the /etc/default/kubelet file
@keith-ms keith-ms changed the title feat: Add build support for GB200/300 image series feat(linux): Add build support for GB200/300 image series May 15, 2026
@cameronmeissner cameronmeissner changed the title feat(linux): Add build support for GB200/300 image series feat(linux): add build support for GB200/300 image series May 15, 2026
rm -f ${POD_INFRA_CONTAINER_IMAGE_TAR}
}

validateKubeletNodeLabels() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn’t currently used anywhere in the call path. I confirmed there are no references to validateKubeletNodeLabels in-tree while working this update (30dedb6).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used. I can remove it, but it probably should be integrated because a label over 63 characters will cause kubelet to fail to start.

Comment thread vhdbuilder/packer/install-dependencies.sh Outdated
Comment thread vhdbuilder/packer/install-dependencies.sh Outdated

# One additional request from MAI: Disable the AKS node problem detector. When this file is present, the Azure AKS VM Extension assumes the NPD has been installed on the VHD and skips installing it at provision time.
mkdir -p /etc/node-problem-detector.d/
touch /etc/node-problem-detector.d/skip_vhd_npd
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you confirm already with the observability team that this skip file is respected by AKS VM extension? I don't think we've moved NPD into the VHD yet

cc @chmill-zz

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the skip_vhd_npd marker creation from the GB200 flow so we no longer rely on that unverified VM extension behavior (30dedb6).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ingested this from the release-gb200 branch. During the testing I did, the expected version of node problem detector got installed. I'd recommend keeping this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the skip file for npd is in place on the extension side, just not implemented broadly on agentbaker side. So the behavior keith tested is working as expected which is fabulous news.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Don't remove the NPD change. Please revert that change in your commit.

Comment thread packer.mk
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8b4cdf6e-94c7-4c74-b5e2-8983bd8d7c23

Co-authored-by: cameronmeissner <24923771+cameronmeissner@users.noreply.github.com>
Copilot AI and others added 2 commits May 15, 2026 18:48
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/7402cb4e-9c9d-4f11-8a2f-e49e30d5a461

Co-authored-by: keith-ms <153014933+keith-ms@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/7402cb4e-9c9d-4f11-8a2f-e49e30d5a461

Co-authored-by: keith-ms <153014933+keith-ms@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not gonna block on this, though I'd prefer we move these artifacts that are only uploaded to GB200/300 VHDs into a subfolder, maybe called graceblackwell or something

else
# However, for the 24.04 ARM images, we MUST have both -azure and -azure-nvidia kernels, so that we can run on either vanilla ARM64 hardware or GB200.
if [ $(dpkg --get-selections | grep -c "linux-image") -lt 2 ]; then
echo "ERROR: Ubuntu 24.04 ARM image is missing either the -azure or -azure-nvidia kernel, cannot continue!" && exit 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personally I'd move this sort of thing to content tests, though I understand the rationale of leaving it here for now to speed up build times

@keith-ms keith-ms enabled auto-merge (squash) May 15, 2026 18:56
@@ -0,0 +1,46 @@
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to manually maintain these versions? Or would you like to auto-update with Renovate? I assume manual given the difficulty in obtaining quota to test post updates.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done because the image requires very specific versions of packages. This isn't meant to be a broadly useful image, despite the gb200 in the name implying something generic.

echo "Generating non-GPU containerd config for GPU node due to VM tags"
echo "${CONTAINERD_CONFIG_NO_GPU_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT

if grep -q 'BinaryName = "/usr/bin/nvidia-container-runtime"' /etc/containerd/config.toml 2>/dev/null; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an explicit GB 200/300 check here as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens at runtime, not build time, and I don't believe there's a way to check this at that point (the feature flag isn't present, though perhaps you can by SKU via IMDS, but at the risk of failure due to IMDS access problems).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Could also consider adding a sentinel flag during build time, such as /etc/aks/gpu-config-baked.marker, which can be checked here. Just an optional suggestion

Comment thread parts/linux/cloud-init/artifacts/ubuntu/doca.list
fi
else
# However, for the 24.04 ARM images, we MUST have both -azure and -azure-nvidia kernels, so that we can run on either vanilla ARM64 hardware or GB200.
if [ $(dpkg --get-selections | grep -c "linux-image") -lt 2 ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth including grep -q "GB200" <<<"$FEATURE_FLAGS" for future safety, when there's no dual booting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants