Skip to content

feat(nvidia): Vulkan vGPU support — memory partitioning via vulkan-layer + manifest auto-mount + admission webhook env injection#1803

Open
100milliongold wants to merge 46 commits into
Project-HAMi:masterfrom
xiilab:feat/vulkan-vgpu
Open

feat(nvidia): Vulkan vGPU support — memory partitioning via vulkan-layer + manifest auto-mount + admission webhook env injection#1803
100milliongold wants to merge 46 commits into
Project-HAMi:masterfrom
xiilab:feat/vulkan-vgpu

Conversation

@100milliongold
Copy link
Copy Markdown

Summary

Adds Vulkan vGPU support to HAMi so that Vulkan workloads (Isaac Sim, ray tracing, GPU-accelerated rendering, etc.) honor the same per-container memory limit that HAMi already enforces for CUDA.

Three coordinated layers, all opt-in via the hami.io/vulkan: "true" pod annotation:

  1. HAMi-core Vulkan layer — hooks vkAllocateMemory to enforce CUDA_DEVICE_MEMORY_LIMIT_0. The Vulkan implicit-layer manifest uses enable_environment: HAMI_VULKAN_ENABLE=1 so the layer is loaded only when the pod opts in.
  2. Device-plugin — bind-mounts the Vulkan implicit-layer manifest from the host into the container at /etc/vulkan/implicit_layer.d/hami.json.
  3. Admission webhook — when the pod has hami.io/vulkan: "true" and requests a GPU resource, injects HAMI_VULKAN_ENABLE=1 and merges graphics into NVIDIA_DRIVER_CAPABILITIES so the runtime exposes the NVIDIA Vulkan ICD.

Pods without the annotation are unaffected — bit-identical to current behavior.

Why

HAMi currently enforces vGPU memory limits only for CUDA workloads. Vulkan applications bypass the limit because vkAllocateMemory is not on the CUDA path. We hit this in production with Isaac Sim — Kit allocates memory through Vulkan, ignored the requested partition, and OOM'd the host. Hooking vkAllocateMemory in the HAMi-core layer closes the gap.

The opt-in annotation is intentional:

  • Avoids forcing the graphics capability on every CUDA pod (would change runtime behavior cluster-wide).
  • The enable_environment guard means the layer doesn't load even if the manifest happens to be mounted, when the env isn't set.
  • Existing CUDA workloads remain unchanged.

What changed

File / Area Change
libvgpu submodule xiilab/HAMi-core@8d4f712 (Vulkan layer with vkAllocateMemory hook, cuMemFree[Async] untracked-pointer fallback, cuMemGetInfo_v2 OptiX crash fix). A companion PR against Project-HAMi/HAMi-core is needed before this can be merged upstream — happy to open it once direction is agreed.
docker/Dockerfile Install libvulkan-dev in the nvbuild stage; copy hami.json into the runtime image at /k8s-vgpu/lib/nvidia/vulkan/implicit_layer.d/hami.json
pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go When /usr/local/vgpu/vulkan/implicit_layer.d/hami.json exists on the host, append a bind-mount into the container's Allocate response. Idempotent and side-effect-free when the file is absent.
pkg/device/nvidia/device.go New applyVulkanAnnotation: when the pod carries hami.io/vulkan: "true", sets HAMI_VULKAN_ENABLE=1 and merges graphics into NVIDIA_DRIVER_CAPABILITIES. Called only when the container actually requests a GPU resource.
pkg/device/nvidia/device_test.go TDD coverage for env injection: no-annotation no-op, capability merge, idempotency, edge cases (existing caps, empty caps, etc.).
examples/nvidia/vulkan_example.yaml Minimal usage sample.
docs/vulkan-vgpu-support.md English usage guide.
docs/vulkan-vgpu-support_cn.md Chinese translation.
docs/vulkan-vgpu-e2e-checklist.md Manual E2E verification checklist.

How it works

  1. The HAMi-core Vulkan layer hooks vkAllocateMemory to enforce the per-container memory limit set by HAMi-core's existing CUDA limit code (same CUDA_DEVICE_MEMORY_LIMIT_0 env).
  2. The device-plugin mounts the implicit-layer manifest into the container so the Vulkan loader picks it up automatically.
  3. The manifest's enable_environment: HAMI_VULKAN_ENABLE=1 guard means the layer isn't activated unless the env is set.
  4. The admission webhook reads hami.io/vulkan: "true", sets the gating env, and merges graphics so the NVIDIA runtime exposes the Vulkan ICD libraries.

Test plan

  • go test ./pkg/device/nvidia/... — env injection unit tests pass.
  • make docker builds the image with libvulkan-dev and ships hami.json.
  • Deploy pod with hami.io/vulkan: "true" annotation → HAMI_VULKAN_ENABLE=1 env present, NVIDIA_DRIVER_CAPABILITIES contains graphics, /etc/vulkan/implicit_layer.d/hami.json mounted.
  • Deploy pod without the annotation → unmodified (regression check).
  • E2E: ran a Vulkan workload (Isaac Sim) with nvidia.com/gpumem limit; the Kit boot log reports the exact partition size and the workload is held to it.

Verified on

  • Hardware: NVIDIA RTX 6000 Ada × 2 (driver 550-series).
  • K8s: v1.34.3.
  • Integration: tested with both stock HAMi (3-tier: webhook + scheduler + device-plugin) and a webhook-only deployment co-existing with Volcano scheduler. The Vulkan changes are orthogonal to scheduling — they only depend on the webhook + device-plugin path.

Compatibility / Breaking changes

  • None for existing CUDA workloads — the Vulkan code paths are gated behind the annotation and the enable_environment runtime guard.
  • New container env (HAMI_VULKAN_ENABLE) and new mount path (/etc/vulkan/implicit_layer.d/hami.json) are added only for opted-in pods.

Notes for reviewers

  • The submodule change to xiilab/HAMi-core@vulkan-layer is the only blocker for upstream merge; happy to open the companion PR against Project-HAMi/HAMi-core once direction is agreed (e.g. accept as a new branch, fold into a release, or restructure as a build flag).
  • This branch carries a few internal planning files under docs/superpowers/ (Korean-language design and implementation plans) that I can drop in a cleanup commit if reviewers prefer a leaner diff.
  • The webhook code at pkg/scheduler/webhook.go:64-69 (the existing scheduler-name skip check) has a known operator-precedence issue in v2.8.x that fix: Add option for overwrite schedulerName #1163 fixed on master — this PR does not touch that block.
  • The device-plugin change is intentionally null-safe: if the host doesn't have hami.json (e.g. the user opted out of running the manifest installer), the Allocate response is unchanged.

Happy to split into smaller PRs (HAMi-core layer / Dockerfile / device-plugin mount / webhook env / docs) if that's easier to review.

@hami-robot hami-robot Bot requested a review from archlitchi April 27, 2026 05:20
@hami-robot hami-robot Bot requested a review from wawa0210 April 27, 2026 05:20
@github-actions github-actions Bot added the kind/feature new function label Apr 27, 2026
@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 27, 2026

Welcome @100milliongold! It looks like this is your first PR to Project-HAMi/HAMi 🎉

@hami-robot hami-robot Bot added the size/XXL label Apr 27, 2026
Adds design spec for extending HAMi's NVIDIA vGPU partitioning
to Vulkan workloads. CUDA and Vulkan share the existing
nvidia.com/gpumem and nvidia.com/gpucores budgets, gated by
the hami.io/vulkan annotation. Interception is implemented
as a Vulkan implicit layer exposed by libvgpu.so (HAMi-core),
sharing in-process counters with the existing CUDA hooks.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…ation

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Local submodule bump pointing at the vulkan-layer branch HEAD 579a421
(see docs/superpowers/plans/notes/hami-core-vulkan-sha.txt). Will be
re-pointed to merged HAMi-core main once the upstream PR lands.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Includes NVML UUID-based physdev_index, VK_VERSION_1_3 guards,
CUDA shim bootstrap for Vulkan-only apps, and all Vulkan layer
work from the vulkan-layer branch.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Mirrors the HAMi-core vulkan-layer branch at
https://github.com/xiilab/HAMi-core.git so fresh clones of
xiilab/HAMi can resolve the submodule without access to a
local-only commit.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Device-plugin already mounts libvgpu.so and /etc/ld.so.preload, but the
HAMi Vulkan layer manifest was missing from user containers -- pods that
opted into Vulkan partitioning had to write the manifest manually.

Ships etc/vulkan/implicit_layer.d/hami.json from the libvgpu submodule
into the HAMi image at /k8s-vgpu/lib/nvidia/vulkan/implicit_layer.d/,
so vgpu-init.sh (recursive copy) lands it on the host. Allocate then
bind-mounts the host file at /etc/vulkan/implicit_layer.d/hami.json in
the user container when present (skipped if absent to avoid blocking
pod startup on nodes not yet populated).

With this, pods carrying hami.io/vulkan="true" annotation get the
Vulkan layer activated automatically via HAMI_VULKAN_ENABLE=1 env var
injected by the webhook.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…lback)

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…ice-plugin 적용 plan)

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Vulkan vGPU partitioning support by updating the libvgpu submodule, configuring Docker images with Vulkan dependencies, and implementing admission webhook logic to inject necessary environment variables and manifest mounts. The changes also include comprehensive design specifications and user documentation. Review feedback suggests expanding manifest mounting to support MIG devices, improving the robustness of the graphics capability merging logic, and optimizing environment variable processing in the admission controller for better efficiency.

Comment thread pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go Outdated
Comment thread pkg/device/nvidia/device.go
Comment thread pkg/device/nvidia/device.go
8 tasks adding NULL pointer guards to 7 CUDA hooks in HAMi-core fork
(cuMemAlloc_v2, cuMemAllocHost_v2, cuMemAllocManaged, cuMemAllocPitch_v2,
cuMemHostAlloc, cuMemHostRegister_v2, cuCtxGetDevice). Pattern follows
commit 03f99d7 (cuMemGetInfo_v2): forward to driver first → NULL/invalid
arg early return → HAMi enforcement. Reduces SegFault risk for callers
(Isaac Sim Kit OptiX/Aftermath/Carbonite) that pass NULL during internal
probes under LD_PRELOAD=/usr/local/vgpu/libvgpu.so.

Spec: docs/superpowers/specs/2026-04-28-hami-isolation-isaac-sim-design.md
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Pulls in 5 commits adding NULL ptr guards / cleanup to:
- cuMemAlloc_v2 (88143ab)
- cuMemAllocManaged (275ba3d)
- cuMemAllocPitch_v2 (01a58f1)
- cuMemHostRegister_v2 — drop vestigial cuCtxGetDevice + NULL guard (7dcb5a4)
- audit notes (7b76d9b)

Pattern matches cuMemGetInfo_v2 (commit 03f99d7). Reduces SegFault
risk for callers (Isaac Sim Kit OptiX/Aftermath/Carbonite) that pass
NULL during internal probes under LD_PRELOAD=/usr/local/vgpu/libvgpu.so.

Spec: docs/superpowers/specs/2026-04-28-hami-isolation-isaac-sim-design.md
Plan: docs/superpowers/plans/2026-04-28-hami-isolation-step-b-cuda-hook-hardening.md
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
@100milliongold
Copy link
Copy Markdown
Author

Step B (HAMi-core hook hardening) complete

HAMi-core PR #182 added NULL pointer guards to CUDA hooks:

  • cuMemAlloc_v2 (88143ab)
  • cuMemAllocManaged (275ba3d)
  • cuMemAllocPitch_v2 (01a58f1)
  • cuMemHostRegister_v2 (7dcb5a4) — also drops vestigial cuCtxGetDevice
  • cuMemHostAlloc / cuCtxGetDevice — tests-only (already correct)

Pattern matches the existing cuMemGetInfo_v2 fix (commit 03f99d7).

The libvgpu submodule pointer is bumped to the new HAMi-core SHA (commit 8cfcebb).

isaac-launchable baseline preserved (5/5 runheadless.sh alive). 9/9 unit tests pass + 4/4 manual NULL stress tests in pod under LD_PRELOAD pass without crash.

Step C (Vulkan layer compat for Isaac Sim Kit init under LD_PRELOAD) follows in a separate plan.

Spec: docs/superpowers/specs/2026-04-28-hami-isolation-isaac-sim-design.md
Plan: docs/superpowers/plans/2026-04-28-hami-isolation-step-b-cuda-hook-hardening.md

7 tasks for HAMi-core Vulkan layer (libvgpu/src/vulkan/) Isaac Sim Kit
init compat: (1) commit unstaged WIP foundation (Enumerate via dispatch
+ first-gipa cache + hami_instance_first helper), (2) GIPA/GDPA
cached-fallback for unknown instance/device handles, (3) HAMI_VK_TRACE
collect actual lookup names during runheadless.sh, (4) evidence-driven
explicit hooks for vkGetPhysicalDevice* names that returned NULL,
(5) dispatch lifetime + chain pLayerInfo deep-copy audit (review-only),
(6) integration verify on ws-node074 with LD_PRELOAD forced (5/5
runheadless alive), (7) push + draft PR comments.

Evidence-driven approach: Tasks 3-4 collect trace data first, then
patch only what proved to break. No speculative hardening. Step D
(isaac-launchable opt-in activation + 4-path verification) is a
separate plan.

Spec: docs/superpowers/specs/2026-04-28-hami-isolation-isaac-sim-design.md §9
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
2026-04-28 Step C 첫 시도가 ws-node074 에서 regression. evidence 는
libvgpu/docs/superpowers/notes/2026-04-28-vk-trace-isaac-sim.md 보존.

새 architecture:
* libvgpu.so = HAMi-core (NVML/CUDA hook + allocator + multiprocess).
  vulkan_mod 제외. vk* 미export. budget 인터페이스만 hami_core_* 로
  명시 export.
* libvgpu_vk.so = Vulkan layer (src/vulkan/* 전체). manifest 로만 활성.
  DT_NEEDED libvgpu.so 로 budget 함수 link-time resolve.

격리 속성: LD_PRELOAD libvgpu.so 단독 시 Vulkan symbol 0 → 4-28 발견된
LD_PRELOAD-only crash class 가 구조적으로 불가능. Vulkan layer 활성은
manifest dlopen path 만 — chain 정상 진입 보장.

검증: local docker (nm/readelf + unit test) + ws-node074 (LD_PRELOAD-only
runheadless × 5 alive + manifest 활성 시 partition clamp).

Out of scope: Tasks 1+2 (cache + GIPA fallback) 재도입은 새 architecture
검증 통과 후 별도 phase. webhook / namespace label 은 Step A/D scope.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
10 tasks for splitting Vulkan layer into libvgpu_vk.so:

(1) add hami_core_* export wrappers (header + impl)
(2) rewrite src/vulkan/budget.c + throttle_adapter.c to call wrappers
(3) pre-split sanity build (combined .so still healthy)
(4) split CMake — libvgpu_vk.so + libvgpu.so loses vulkan_mod
(5) ELF / symbol diff verification (the structural-isolation proof)
(6) unit tests against split build
(7) ship hami.json implicit-layer manifest
(8) ws-node074 LD_PRELOAD-only smoke (regression-killed proof)
(9) ws-node074 manifest-activated smoke (layer doing its job)
(10) push fork + bump submodule + draft PR comments

Production safety gate at Task 8: backup before swap, baseline runheadless
check after swap, restore on any anomaly. Manifest install (Task 9) only
if Task 8 5/5 alive — confirms LD_PRELOAD-only regression class is gone.

Spec: docs/superpowers/specs/2026-04-29-step-c-redesign-vk-so-split.md
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Pulls in the Step C redesign: Vulkan layer code is now a separate
libvgpu_vk.so, activated by /etc/vulkan/implicit_layer.d/hami.json.
libvgpu.so retains only HAMi-core (NVML/CUDA hooks + allocator +
multiprocess) and loses all vk* exports.

Verified on ws-node074 isaac-launchable-0:
* LD_PRELOAD libvgpu.so without manifest: 5/5 runheadless exit=124 alive
  crash=0 — the 2026-04-28 regression class is structurally gone.
* LD_PRELOAD libvgpu.so + hami.json manifest installed in pod:
  5/5 alive, NVML clamp 44 GiB → 23 GiB. Vulkan loader enumerated the
  hami manifest. Trace lines remained 0 in this verification pass —
  Kit's runheadless path under the embedded Conan vulkan-loader did not
  invoke our GIPA wrappers in this run; functional partition enforce
  on the Vulkan path will be confirmed in Step D's 4-path verification.
* Production .so restored to backup md5 8f889313 after verification.

Spec: docs/superpowers/specs/2026-04-29-step-c-redesign-vk-so-split.md
Plan: docs/superpowers/plans/2026-04-29-step-c-vk-so-split.md
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…ath 검증

Step C 산출물 (libvgpu.so + libvgpu_vk.so 분리, hami.json INSTANCE
manifest) 을 production 의 opt-in path 에서 실제로 활성화 + partition
enforce 가 4 path 모두에서 작동함을 입증.

핵심 결정:
* volcano-vgpu-device-plugin image rebuild (vulkan-v2) — postStart
  lifecycle 가 새 libvgpu.so + libvgpu_vk.so 둘 다 호스트로 install.
* hami-vulkan-manifest CM update — library_path → libvgpu_vk.so,
  type → INSTANCE, enable_environment HAMI_VULKAN_ENABLE 유지.
* hami-vulkan-manifest-installer DS 재활성 (nodeSelector 복구).
* webhook applyVulkanAnnotation 코드 그대로 — annotation hami.io/vulkan
  true 시 HAMI_VULKAN_ENABLE + graphics capability 주입.
* 4-path verification: NVML clamp / CUDA mem_get_info clamp / Vulkan
  memoryHeaps clamp / Vulkan allocate over-budget OOM.

Production safety: 각 단계 백업 + post-step baseline runheadless 1회
확인 + 실패시 즉시 rollback.

Out of scope: helm chart 통합, Tasks 1+2 재도입, multi-GPU 케이스.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
9 tasks for productionizing the Step C libvgpu_vk.so split:

(1) inventory current production state + baseline
(2) build & push volcano-vgpu-device-plugin:vulkan-v2 (libvgpu submodule bump)
(3) update hami-vulkan-manifest CM (library_path → libvgpu_vk.so, type → INSTANCE)
(4) re-enable hami-vulkan-manifest-installer DS (nodeSelector 복구)
(5) bump volcano-device-plugin DS image
(6) annotate isaac-launchable-0 + restart + initial activation verify
(7) 4-path verification (NVML / CUDA / Vulkan memory / Vulkan allocate)
(8) HAMI_VK_TRACE host-loader probe + sanity check other Vulkan pods
(9) push snapshot YAMLs + draft PR comments

Production safety: each Task ends with post-step baseline runheadless
1회 + immediate rollback on regression. Manifest activation gated by
enable_environment HAMI_VULKAN_ENABLE=1 — without the webhook injecting
this env (annotation absent), the layer stays inert and other pods are
unaffected.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
library_path = /usr/local/vgpu/libvgpu_vk.so (Step C split target).
type = INSTANCE (per spec; matches single-instance Vulkan layer
contract instead of the deprecated GLOBAL).

enable_environment HAMI_VULKAN_ENABLE=1 unchanged — opt-in trigger
flows through the existing webhook applyVulkanAnnotation.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Extends applyVulkanAnnotation to also inject a hostPath volume that
mounts /etc/vulkan/implicit_layer.d/hami.json from the host into the
container at the same path. This makes the Vulkan loader inside the
pod discover our implicit-layer manifest, which is required for
libvgpu_vk.so (Step C split) to actually enter the chain.

Why a hostPath volume instead of a ConfigMap volume: ConfigMap volumes
must reference a CM in the pod's own namespace; hostPath sidesteps
cross-namespace replication. The host file is kept in sync by
hami-vulkan-manifest-installer DaemonSet (Step D Task 4), which writes
the ConfigMap content to /etc/vulkan/implicit_layer.d/hami.json on
every GPU node.

The volume is added once per pod (idempotent across multi-container
pods); the volumeMount is added once per opt-in container. Pods that
don't carry the hami.io/vulkan annotation see no volume/mount changes.

Tests: 3 new test cases covering volume+mount injection, idempotency
across multiple containers, and no-op when annotation is absent.
All 9 webhook test cases pass.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
libvgpu/build.sh runs 'git describe --abbrev=0 --tags' under set -e to
populate CI_COMMIT_SHA. Inside the nvbuild Docker stage there is no
.git/modules/libvgpu (only source files are COPYed in), so git describe
exits non-zero and set -e aborts the script. Pre-setting CI_COMMIT_SHA
skips the git block entirely.

Also lands two Step D runtime snapshot files:
- hami-vulkan-manifest-installer-ds.yaml: nodeSelector → nvidia.com/gpu.present, install path → /etc/vulkan/implicit_layer.d/hami.json
- volcano-device-plugin-ds.yaml: image vulkan-v1 → vulkan-v2

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Volcano-vgpu-device-plugin's per-pod Allocate response mounts /usr/local/
vgpu/libvgpu.so but not libvgpu_vk.so. Without libvgpu_vk.so visible
in the container at the path the implicit-layer manifest references
(library_path: /usr/local/vgpu/libvgpu_vk.so), the Vulkan loader logs
"Requested layer VK_LAYER_HAMI_vgpu failed to load: cannot open shared
object file" and falls back to no layer.

Extend applyVulkanAnnotation to add a second hostPath volume + mount
for libvgpu_vk.so. Refactor the previous manifest-injection helpers
into generic ensureHostPathFileVolume / ensureHostPathFileVolumeMount
so adding more host-file injections later stays cheap.

The libvgpu_vk.so mount is opt-in via the same hami.io/vulkan: "true"
annotation as the manifest mount — pods without the annotation see
neither change.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
… in split arch

Cherry-picks 996cb22 + eea2beb (originally Step C Tasks 1+2; reverted
as 83fd245 + f52aada due to LD_PRELOAD-only regression on the
combined libvgpu.so build). With the Step C redesign, libvgpu_vk.so is
manifest-activated only — the ICD-init regression scenario is gone, so
the EnumerateDevice* hooks + GIPA fallback can return safely.

Path 4 (vkAllocateMemory over-budget) of the Step D 4-path verification
required these hooks: NVIDIA's vkEnumerateDeviceExtensionProperties
returns LayerNotPresent when the loader queries with our layer name,
breaking vkCreateDevice in the layer chain. Re-adding our HAMI_HOOK
for that name + the dispatch resolves makes own-name queries return
0 entries (correct) and other queries forward.

Includes the Step D snapshot files committed earlier:
- hami-vulkan-manifest-cm.yaml: type INSTANCE, library_path
  /usr/local/vgpu/libvgpu_vk.so, enable_environment +
  disable_environment (both required by Vulkan loader spec).
- vk_partition_test.py: 4-path verification probe.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
@100milliongold
Copy link
Copy Markdown
Author

Step C — Vulkan layer split (libvgpu_vk.so)

HAMi-core PR #182 redesigned Step C: libvgpu.so is now HAMi-core only, and a new libvgpu_vk.so holds the Vulkan implicit layer. Activation moves entirely to the manifest path, removing the LD_PRELOAD/Vulkan-loader collision surface that bit us on 2026-04-28 (see docs/superpowers/notes/2026-04-28-vk-trace-isaac-sim.md in the submodule).

The libvgpu submodule pointer is bumped from 7dcb5a4 (Step B end) to 65930f4 (Step C redesign end).

Verification on ws-node074, isaac-launchable-0

  • LD_PRELOAD libvgpu.so without manifest: 5/5 runheadless.sh exit=124 alive, crash=0, listen :49100 — the regression class introduced by the 2026-04-28 first attempt is structurally gone in this architecture.
  • LD_PRELOAD libvgpu.so + hami.json manifest installed in pod: 5/5 alive, manifest enumerated by Vulkan loader, NVML hook clamp 44 GiB → 23 GiB confirmed via nvidia-smi.
  • HAMI_VK_TRACE instrumentation showed 0 lines on the manifest path during this run; the embedded Conan Vulkan loader inside Kit's runheadless did not invoke our GIPA wrappers in this test. Partition enforcement on the Vulkan path will be confirmed in Step D's 4-path verification once the webhook installs the manifest at the host level alongside libvgpu_vk.so.
  • Production /usr/local/vgpu/libvgpu.so restored to backup md5 8f889313 after verification — pod baseline exit=124 crash=0 listen=1 confirmed post-restore.

What's next

  • Step D (separate plan) will productionize the activation: webhook-injected LD_PRELOAD env + DaemonSet-installed hami.json + host-installed libvgpu_vk.so, plus the 4-path partition-enforcement verification (NVML, CUDA, Vulkan memory query, Vulkan allocate).
  • The original Step C "Tasks 1+2" (cache first next-gipa, GIPA/GDPA fallback for unknown handles) stay reverted and become a follow-up PR after Step D end-to-end verification; the trace evidence already on the submodule branch is enough background for that follow-up.

Spec / Plan

  • Spec: docs/superpowers/specs/2026-04-29-step-c-redesign-vk-so-split.md
  • Plan: docs/superpowers/plans/2026-04-29-step-c-vk-so-split.md

@100milliongold
Copy link
Copy Markdown
Author

Step D — Vulkan opt-in production activation + 4-path 검증

Step C 의 libvgpu_vk.so 분리 산출물을 production opt-in path 에서 활성화. ws-node074 isaac-launchable-0 에서 NVML / CUDA / Vulkan-memory-query / Vulkan-allocate 4 path 모두 partition enforce 검증 완료.

Commits (HAMi parent, this push)

SHA Message
9f06dc0 chore(runtime): Step D — update hami-vulkan-manifest CM to libvgpu_vk.so
c5d4b89 feat(nvidia): inject Vulkan manifest hostPath mount on opt-in pods
8542d63 build(docker): set CI_COMMIT_SHA in nvbuild stage to skip git describe
c15889a feat(nvidia): also inject libvgpu_vk.so hostPath into Vulkan opt-in pods
a55a5a5 chore(libvgpu): bump HAMi-core for Step D — re-apply Step C Tasks 1+2 in split arch

Companion changes

  • libvgpu fork (xiilab/vulkan-layer): cherry-picked Step C Tasks 1+2 (996cb22, eea2beb) on top of the previously-pushed Step C redesign — they were originally reverted due to LD_PRELOAD-only regression on the combined-libvgpu.so build, but the Step C redesign moved Vulkan code into a separate libvgpu_vk.so that is manifest-activated only, so that regression scenario can no longer occur. Path 4 (vkAllocateMemory over-budget OOM) requires these hooks because NVIDIA's vkEnumerateDeviceExtensionProperties returns LayerNotPresent for our layer name and breaks vkCreateDevice chain assembly without them.
  • volcano-vgpu-device-plugin fork (xiilab/feat/vulkan-vgpu-support): libvgpu submodule bumped to vulkan-v2 build; image rebuilt and pushed as volcano-vgpu-device-plugin:vulkan-v2. (After this Step D PR merges, that image needs another bump to pick up the cherry-picked Tasks 1+2.)
  • HAMi parent helm chart upgrade: helm upgrade hami-webhook charts/hami -n hami-system --set global.imageTag=vulkan-v3 --set scheduler.extender.image.tag=vulkan-v3 --set scheduler.admissionWebhook.namespaceSelector.{mode=opt-in,matchLabels={},matchExpressions=[]} --set scheduler.admissionWebhook.objectSelector.{matchLabels={},matchExpressions=[]} --set scheduler.admissionWebhook.whitelistNamespaces=[] --set prometheus.enabled=false (revision 8 deployed).
  • Namespace label: kubectl label namespace isaac-launchable hami.io/vgpu=enabled (required by namespaceSelector.mode=opt-in).
  • Manifest installer DaemonSet nodeSelector patched to remove legacy hami.io/disabled=true so it actually schedules on GPU nodes.

Verification on ws-node074, isaac-launchable-0 (annotation hami.io/vulkan: "true" + webhook injection)

Path Method Result
1 NVML nvidia-smi --query-gpu=memory.total --format=csv,noheader 23552 MiB (clamped from raw 46068) ✓
2 CUDA pycuda.driver.mem_get_info() SKIP (pycuda not installed in image — informational; NVML hook covers the same code path via dlsym)
3 Vulkan memory query vkGetPhysicalDeviceMemoryProperties device-local heap 24696061952 bytes (23552 MiB) clamp via dispatch ✓
4 Vulkan allocate vkAllocateMemory(25 GiB) VK_ERROR_OUT_OF_DEVICE_MEMORY (HAMi-core OOM logged: Device 0 OOM 26843545600 / 24696061952) ✓

HAMI_VK_TRACE lines: 93 (layer fully in chain — Insert instance layer "VK_LAYER_HAMI_vgpu" (/usr/local/vgpu/libvgpu_vk.so)).
runheadless.sh × 3: 3/3 alive exit=124 crash=0.

Sanity

  • isaac-launchable-1 (annotation present but pod not yet rolled-restarted): still alive, no manifest volume injected (intended — webhook v3 only injects on new pod creation).
  • usd-composer: 3/3 Running, no crash loop.
  • All other namespace pods steady.

Production state at end of verification

  • /usr/local/vgpu/libvgpu.so: md5 b23793609588d45510d908fa193bfc7b (HAMi-core w/ cherry-picked Tasks 1+2).
  • /usr/local/vgpu/libvgpu_vk.so: md5 58ebee3e61dc836d83142a61c51c8139 (Vulkan layer w/ EnumerateDevice* hooks).
  • /etc/vulkan/implicit_layer.d/hami.json: 577 bytes, INSTANCE / libvgpu_vk.so / enable_environment + disable_environment.
  • hami-vulkan-manifest-installer DaemonSet: 1/1 ready on ws-node074.
  • volcano-device-plugin DaemonSet: image vulkan-v2 (containing OLDER libvgpu_vk.so without Tasks 1+2). Follow-up: bump volcano-vgpu-device-plugin libvgpu submodule to current bdd5bbe and rebuild as vulkan-v3 so DS pod restart doesn't reset host .so to the older build.

Spec / Plan

  • Spec: docs/superpowers/specs/2026-04-29-step-d-vulkan-opt-in-production-activation.md
  • Plan: docs/superpowers/plans/2026-04-29-step-d-vulkan-opt-in-production-activation.md
  • Step C redesign spec/plan in same directory.

libvgpu vulkan-layer rebased onto Project-HAMi/HAMi-core main with
--signoff so the 2 revert commits (which git revert --no-edit created
without Signed-off-by) now carry DCO trailers, plus a small style fix
that wraps a HAMI_TRACE call to keep src/vulkan/layer.c under the
120-char cpplint limit.

PR Project-HAMi#182 CI: DCO + cpplint failures fixed; functional code unchanged
from the previously-verified bdd5bbe build (Step D 4-path PASS on
ws-node074).

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Resolves Dockerfile conflict by combining both intents:
- upstream Project-HAMi#1782: install git in nvbuild stage + apt cache cleanup
  (so libvgpu git describe works inside nvbuild)
- ours: keep libvulkan-dev (Vulkan layer build) and CI_COMMIT_SHA fallback

All other changes auto-merged cleanly. Build (go build ./...) passes.
TestCheckHealth/Kernel_6.17_Bug:_handshake_expired pre-exists on
origin/master 31a2792 — unrelated to this merge.
@100milliongold 100milliongold force-pushed the feat/vulkan-vgpu branch 2 times, most recently from 7b272c8 to 69daec9 Compare May 4, 2026 14:04
@100milliongold 100milliongold force-pushed the feat/vulkan-vgpu branch 2 times, most recently from 44fab32 to 5b0bd1b Compare May 4, 2026 14:58
…orkaround

Bumps libvgpu submodule from c3beead to e71fa3c, which adds:

  fix(vulkan): hook GetPhysicalDeviceMemoryProperties2KHR alias +
               inflate heapUsage on budget query

Carbonite/Kit and similar engines targeting Vulkan 1.0 with the KHR
extension form of properties2 query the layer GIPA for the
*KHR-suffixed* alias of vkGetPhysicalDeviceMemoryProperties2; without
the alias hook the call falls through to the ICD and our heap clamp
is bypassed (engines saw the host GPU's full ~45 GiB instead of the
partition limit).

Additionally, the VK_EXT_memory_budget extension's pNext-chained
budget struct is now adjusted via heapUsage inflation rather than
heapBudget clamping. Direct heapBudget clamping deadlocks
omni.physx.tensors during plugin init; inflating heapUsage instead
preserves the on-screen "available" math (heapBudget - heapUsage =
partition_limit - real_usage) without disturbing PhysX, which
consumes only heapBudget.

Verified end-to-end on isaac-launchable-0 (23 GiB partition,
ws-node074, NVIDIA 580.142, Isaac Sim Streaming 6.0):

  Pre-fix:  size=44.99 GiB heapBudget=32.12 GiB
            overlay "1.3 used / 32.2 available"   [false]
  Post-fix: size=23.00 GiB heapBudget=33.4  GiB heapUsage=11.5 GiB
            overlay "11.5 used / 21.9 available"  [21.9 = 23 - 1.1]
            omni.physx.tensors initializes cleanly; Kit reaches
            streaming-ready idle at 60 fps.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Resolves conflict in libvgpu submodule by merging HAMi-core master
(CUDA 13 conditional + vllm eager OOM check, commit 8c32de6) into our
vulkan-layer branch — new submodule SHA carries the merge.

Auto-merged superproject files (no manual intervention):
- charts/hami/values.yaml
- docker/Dockerfile
- pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
- pkg/device/nvidia/device.go

Verified:
- go build ./... clean
- go test ./pkg/device/nvidia/... ./pkg/device-plugin/nvidiadevice/nvinternal/plugin/... pass
- mergeGraphicsCap + appendVulkanManifestMount paths intact
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants