feat(linux): add build support for GB200/300 image series#8521
feat(linux): add build support for GB200/300 image series#8521keith-ms wants to merge 87 commits into
Conversation
…the GB200 platform
…E_LABELS are written to a /etc/default/kubelet
…els are written to the /etc/default/kubelet file
| rm -f ${POD_INFRA_CONTAINER_IMAGE_TAR} | ||
| } | ||
|
|
||
| validateKubeletNodeLabels() { |
There was a problem hiding this comment.
where is this used?
There was a problem hiding this comment.
It isn’t currently used anywhere in the call path. I confirmed there are no references to validateKubeletNodeLabels in-tree while working this update (30dedb6).
There was a problem hiding this comment.
This isn't used. I can remove it, but it probably should be integrated because a label over 63 characters will cause kubelet to fail to start.
|
|
||
| # One additional request from MAI: Disable the AKS node problem detector. When this file is present, the Azure AKS VM Extension assumes the NPD has been installed on the VHD and skips installing it at provision time. | ||
| mkdir -p /etc/node-problem-detector.d/ | ||
| touch /etc/node-problem-detector.d/skip_vhd_npd |
There was a problem hiding this comment.
did you confirm already with the observability team that this skip file is respected by AKS VM extension? I don't think we've moved NPD into the VHD yet
cc @chmill-zz
There was a problem hiding this comment.
I removed the skip_vhd_npd marker creation from the GB200 flow so we no longer rely on that unverified VM extension behavior (30dedb6).
There was a problem hiding this comment.
I ingested this from the release-gb200 branch. During the testing I did, the expected version of node problem detector got installed. I'd recommend keeping this.
There was a problem hiding this comment.
the skip file for npd is in place on the extension side, just not implemented broadly on agentbaker side. So the behavior keith tested is working as expected which is fabulous news.
There was a problem hiding this comment.
@copilot Don't remove the NPD change. Please revert that change in your commit.
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/8b4cdf6e-94c7-4c74-b5e2-8983bd8d7c23 Co-authored-by: cameronmeissner <24923771+cameronmeissner@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/7402cb4e-9c9d-4f11-8a2f-e49e30d5a461 Co-authored-by: keith-ms <153014933+keith-ms@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/AgentBaker/sessions/7402cb4e-9c9d-4f11-8a2f-e49e30d5a461 Co-authored-by: keith-ms <153014933+keith-ms@users.noreply.github.com>
There was a problem hiding this comment.
not gonna block on this, though I'd prefer we move these artifacts that are only uploaded to GB200/300 VHDs into a subfolder, maybe called graceblackwell or something
| else | ||
| # However, for the 24.04 ARM images, we MUST have both -azure and -azure-nvidia kernels, so that we can run on either vanilla ARM64 hardware or GB200. | ||
| if [ $(dpkg --get-selections | grep -c "linux-image") -lt 2 ]; then | ||
| echo "ERROR: Ubuntu 24.04 ARM image is missing either the -azure or -azure-nvidia kernel, cannot continue!" && exit 1 |
There was a problem hiding this comment.
nit: personally I'd move this sort of thing to content tests, though I understand the rationale of leaving it here for now to speed up build times
| @@ -0,0 +1,46 @@ | |||
| { | |||
There was a problem hiding this comment.
Do you plan to manually maintain these versions? Or would you like to auto-update with Renovate? I assume manual given the difficulty in obtaining quota to test post updates.
There was a problem hiding this comment.
This is done because the image requires very specific versions of packages. This isn't meant to be a broadly useful image, despite the gb200 in the name implying something generic.
| echo "Generating non-GPU containerd config for GPU node due to VM tags" | ||
| echo "${CONTAINERD_CONFIG_NO_GPU_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT | ||
|
|
||
| if grep -q 'BinaryName = "/usr/bin/nvidia-container-runtime"' /etc/containerd/config.toml 2>/dev/null; then |
There was a problem hiding this comment.
Should there be an explicit GB 200/300 check here as well?
There was a problem hiding this comment.
This happens at runtime, not build time, and I don't believe there's a way to check this at that point (the feature flag isn't present, though perhaps you can by SKU via IMDS, but at the risk of failure due to IMDS access problems).
There was a problem hiding this comment.
Fair. Could also consider adding a sentinel flag during build time, such as /etc/aks/gpu-config-baked.marker, which can be checked here. Just an optional suggestion
| fi | ||
| else | ||
| # However, for the 24.04 ARM images, we MUST have both -azure and -azure-nvidia kernels, so that we can run on either vanilla ARM64 hardware or GB200. | ||
| if [ $(dpkg --get-selections | grep -c "linux-image") -lt 2 ]; then |
There was a problem hiding this comment.
Would it be worth including grep -q "GB200" <<<"$FEATURE_FLAGS" for future safety, when there's no dual booting?
This change pulls in the commits specific to the
release-gb200branch intomainrather than trying to merge commit frommainintorelease-gb200.I've modified the
vhdbuilder/packer/pre-install-dependencies.shfrom therelease-gb200branch so that the NVIDIA kernel path from the PPA depends on the presence of theGB200feature flag.