feat(vulkan): Vulkan implicit layer to enforce per-pod GPU memory budget on vkAllocateMemory by 100milliongold · Pull Request #182 · Project-HAMi/HAMi-core

100milliongold · 2026-04-27T05:33:09Z

Summary

Adds a Vulkan implicit layer to HAMi-core that hooks vkAllocateMemory / vkFreeMemory and enforces the same per-pod memory budget HAMi-core already enforces for CUDA. This closes the gap where Vulkan workloads (Isaac Sim, ray tracing, GPU-accelerated rendering) bypass the limit because allocations don't go through the CUDA driver.

This is the companion to Project-HAMi/HAMi#1803, which wires up the manifest mount in the device-plugin and the env injection in the admission webhook. That PR cannot be merged upstream until this layer ships in HAMi-core, so I'm opening this first.

Why

HAMi-core currently enforces vGPU memory limits by intercepting CUDA driver calls (cuMemAlloc, cuMemAllocAsync, etc.). Vulkan applications allocate device memory through vkAllocateMemory, which goes directly to the NVIDIA Vulkan ICD and never touches HAMi-core's CUDA hook table. We hit this in production with Isaac Sim — Kit allocates several GB through Vulkan, ignored the requested partition, and OOM'd the host.

The Vulkan layer reuses HAMi-core's existing CUDA budget bookkeeping (CUDA_DEVICE_MEMORY_LIMIT_0 etc.) so a pod that asks for 4000 MiB gets the same enforcement on both APIs.

Design

The layer is a standard Vulkan implicit layer — Vulkan's loader picks it up automatically when the manifest is on the system.
The manifest is gated by enable_environment: HAMI_VULKAN_ENABLE=1, so the layer only activates when the pod opts in. CUDA-only workloads see no behavior change.
The layer hooks vkCreateInstance to chain the dispatch table, then vkAllocateMemory, vkFreeMemory, vkGetPhysicalDeviceMemoryProperties[2], and vkQueueSubmit[2]. Other entry points pass through unchanged.
Device index resolution uses NVML UUID lookup so the per-device budget matches what HAMi-core already tracks for CUDA.
A small "budget adapter" forwards Vulkan allocation accounting into HAMi-core's existing counter machinery — no parallel bookkeeping, no risk of CUDA and Vulkan budgets drifting apart.
A "throttle adapter" forwards vkQueueSubmit[2] calls through HAMi-core's rate_limiter so the existing core-utilization throttling also applies to Vulkan workloads.
Memory property queries clamp the device-local heap size to the per-pod budget so apps that pre-size their allocations from vkGetPhysicalDeviceMemoryProperties don't try to grab the full physical heap.

What changed

Area	Change
`src/vulkan/layer.{c,h}`	Layer entry point + dispatch table chaining (`vkCreateInstance` / `vkGetInstanceProcAddr` / `vkGetDeviceProcAddr`).
`src/vulkan/dispatch.{c,h}`	Per-instance dispatch table loader.
`src/vulkan/hooks_alloc.c`	`vkAllocateMemory` / `vkFreeMemory` hooks that consult HAMi-core's CUDA budget.
`src/vulkan/hooks_memory.c`	`vkGetPhysicalDeviceMemoryProperties[2]` clamping device-local heap to pod budget.
`src/vulkan/hooks_submit.c`	`vkQueueSubmit[2]` rate-limiting via `throttle_adapter`.
`src/vulkan/budget.h`	Adapter type forwarding allocation events into HAMi-core counters.
`src/vulkan/throttle_adapter.{c,h}`	Adapter forwarding queue submit throttling to `rate_limiter`.
`src/vulkan/physdev_index.{c,h}`	Resolve `VkPhysicalDevice` → device index via NVML UUID.
`src/vulkan/hami_implicit_layer.json`	Vulkan implicit-layer manifest, gated by `enable_environment: HAMI_VULKAN_ENABLE=1`.
`CMakeLists.txt` (+ install rules)	Build `vulkan_mod` as an OBJECT library linked into `libvgpu.so`; install the manifest. Library symbol surface unchanged for CUDA-only consumers.
`test/vulkan/*.c`	Unit tests: layer init, allocation budget, memprops clamp, queue submit throttle, throttle adapter forwarding.
`test/CMakeLists.txt`	Vulkan tests built as a separate target so existing CUDA test glob is unchanged.
`src/cuda/...` (2 fixes)	`cuMemFree[Async]` falls back to the real driver when called with an untracked pointer (some Vulkan ICD callbacks free pointers we never tracked). `cuMemGetInfo_v2` guards against NULL out params (OptiX crash repro).

Build

vulkan_mod is built as an OBJECT library and linked into libvgpu.so, so:

The shared library symbol surface for existing CUDA-only consumers is unchanged.
The Vulkan layer ships inside the same .so, so HAMi only needs to mount one library.
The manifest installs to \$\{CMAKE_INSTALL_PREFIX\}/etc/vulkan/implicit_layer.d/hami.json.

Vulkan headers >= 1.3.280 are required for vkQueueSubmit2. A fallback for the VK_LAYER_EXPORT macro covers older headers.

Test plan

make test — unit tests under test/vulkan/ pass on a host without an NVIDIA GPU (mocked dispatch).
Built as part of HAMi's make docker (with libvulkan-dev installed).
E2E: deployed inside a HAMi cluster (with the companion HAMi PR), ran an Isaac Sim pod with nvidia.com/gpumem: 4000 + hami.io/vulkan: "true". Kit boot log reports GPU Memory: 4000 MB and the workload is held to it.
Regression: same image without the annotation runs CUDA workloads bit-identical to the previous release.

Compatibility

CUDA-only workloads: zero behavior change. The Vulkan layer is gated by both the manifest's enable_environment and HAMi-core's existing per-pod budget — neither activates without the pod opting in.
Vulkan workloads without the annotation: no Vulkan layer loaded, no env, no behavior change.
Vulkan workloads with the annotation: per-pod memory budget enforced on vkAllocateMemory, queue submit throttled like CUDA kernel launches.

Notes for reviewers

This branch uses Vulkan-Headers 1.3.280+ to access vkQueueSubmit2. If the project prefers an older minimum, the relevant code path is already guarded by VK_VERSION_1_3 and can be made strictly conditional.
The two src/cuda/* fixes (cuMemFree[Async] untracked-pointer fallback, cuMemGetInfo_v2 NULL guard) are small and stand on their own — happy to split them into a separate PR if it makes review easier.
I can split this PR into smaller pieces (skeleton + dispatch / alloc hooks / submit throttle / memprops clamp / build integration / tests) if that's easier to review.
The companion HAMi PR is Project-HAMi/HAMi#1803. It currently points the libvgpu submodule at xiilab/HAMi-core@vulkan-layer; once this PR is merged I'll update it to point at the upstream commit.

hami-robot · 2026-04-27T05:33:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 100milliongold
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2026-04-27T05:33:19Z

Welcome @100milliongold! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

100milliongold · 2026-04-27T15:45:45Z

Honest status update — Vulkan hook chain not fully wired in production

I want to disclose a partial regression I found while doing end-to-end validation on a real cluster (RTX 6000 Ada × 2, NVIDIA driver 580.142, K8s 1.34, isaac-launchable production workload). I would love guidance on the right fix.

What works

webhook env injection (HAMI_VULKAN_ENABLE=1, graphics capability merge) — verified via container env.
volcano-vgpu-device-plugin Vulkan manifest auto-mount at /etc/vulkan/implicit_layer.d/hami.json — verified.
The implicit-layer manifest is parsed by the Vulkan loader; Loading layer library /usr/local/vgpu/libvgpu.so, Insert instance layer "VK_LAYER_HAMI_vgpu", vkCreateInstance layer callstack setup to: VK_LAYER_HAMI_vgpu, and the corresponding device-side line all show up under VK_LOADER_DEBUG=all.
After commit 30ddb01 the loader entry points (vkNegotiateLoaderLayerInterfaceVersion, vkGetInstanceProcAddr, vkGetDeviceProcAddr) are exported with default visibility (an earlier build had -fvisibility=hidden and only hami_*-prefixed symbols, so the loader could not resolve the layer at all — that part is now fixed).
CUDA-path enforcement is intact: PyTorch ≥ 25 GiB allocations on a 23 GiB-budgeted pod return CUDA out of memory from HAMi-core's cuMemAlloc hook.

What does NOT work (yet)

A small Vulkan-only test that talks to libvulkan.so.1 directly (ctypes, minimal vkCreateInstance → vkGetPhysicalDeviceMemoryProperties → vkCreateDevice → vkAllocateMemory(25 GiB)) shows that the layer's hook functions are not actually called:

heap[0] size = 44.99 GiB device_local=True   ← native, not 23 GiB; vkGetPhysicalDeviceMemoryProperties hook bypassed
=== vkAllocateMemory tests ===
  20 GiB: SUCCESS
  22 GiB: SUCCESS
  25 GiB: SUCCESS                            ← over the 23 GiB budget
  30 GiB: SUCCESS                            ← also over

So the loader sees the layer and inserts it into the call stack, but at dispatch time the calls go straight to the ICD — our hami_vkGetPhysicalDeviceMemoryProperties and hami_vkAllocateMemory are never invoked. The same is presumably true for hami_vkQueueSubmit (which is why I cannot yet show queue-throttle evidence either).

Where I am stuck

The most likely failure modes I am tracking:

hami_vkCreateInstance not actually entered. If our hami_vkCreateInstance is bypassed by the loader, hami_instance_register() never runs, so the instance dispatch table our hooks rely on is empty, and hami_vkGetInstanceProcAddr falls back through if (!d) return NULL; for any name that isn't in the explicit HAMI_HOOK(...) list. I have not yet been able to confirm whether hami_vkCreateInstance is being called.
Wrong layer interface version with manifest 1.2.0. Our manifest is file_format_version: "1.2.0" and we export vkNegotiateLoaderLayerInterfaceVersion, so the loader should pick interface v2. The loader debug output says using deprecated 'vkGetInstanceProcAddr' tag when I add a functions block to the manifest, and shows neither Enable Env Var nor any negotiate trace when I remove that block — I cannot tell from the log alone which path actually wins.
Layer hook is built but not chain-linked. nm -D libvgpu.so shows all hami_* and the canonical vk* entry points exported. Yet we still see no hook activity at runtime. This makes me suspect a chain-info handling bug in src/vulkan/layer.c::hami_vkCreateInstance (e.g. wrong VK_LAYER_LINK_INFO walk, or chain->u.pLayerInfo = chain->u.pLayerInfo->pNext advancing into NULL).

I plan to add fprintf(stderr, "HAMI: hami_vkCreateInstance entered\n") style instrumentation, rebuild, and rerun the ctypes test to localize which of (1)–(3) is actually true. But before I send the next round, I would really value your read on the chain handling in src/vulkan/layer.c — if you spot something obviously wrong from a quick skim it would save a debug cycle.

Disclosure

Per the AI Assistance Notice — Claude (Anthropic) was used as an editing/exploration assistant for this work, but every decision, build, and validation step was run and verified by me on the actual cluster. The failure mode above is a real one I observed in production; I am bringing it to you because the PR is not yet shippable and I would rather acknowledge that openly than push a half-working layer.

Companion PR status

HAMi #1803 (webhook + device-plugin manifest mount): unaffected (the manifest-side stack does its job; the issue is purely inside HAMi-core's Vulkan layer dispatch).
volcano-vgpu-device-plugin Fix cuMallocAsync not released properly #118 / 有办法限制特权容器nvidia-smi显示的显卡吗 #119: unaffected for the same reason.

Thanks for taking another look. /cc @archlitchi @wawa0210 @FouoF

100milliongold · 2026-04-27T16:17:59Z

Update — partition enforcement now verified end-to-end

Following up on my earlier disclosure, I traced the root cause and pushed a fix in 93dd103.

Root cause

The Vulkan loader chain, layer entry points, and vkAllocateMemory hook were all wired correctly — that part of this PR works. The break was in the physical-device → NVML index resolver:

HAMI_VK_TRACE: vk_uuid (target) = 00000000-0000-0000-0000-000000000000
HAMI_VK_TRACE: nvml[0] uuid_str='GPU-07ac64f7-ff7f-d9fb-9bcb-7afb7720386f'
HAMI_VK_TRACE: nvml_index_for_uuid: NO MATCH -> -1
HAMI_VK_TRACE: clamp_heaps physdev_index -> dev=-1
HAMI_VK_TRACE: clamp_heaps EARLY RETURN (dev<0)
HAMI_VK_TRACE: hami_vkAllocateMemory: idx<0 -> SKIP budget enforcement

VkPhysicalDeviceIDProperties.deviceUUID returned by NVIDIA driver 580.142 inside the container was 16 bytes of zero, even though vkGetPhysicalDeviceProperties2 returns successfully and VK_KHR_external_memory_capabilities is exposed. NVML's nvmlDeviceGetUUID returned the correct GPU-… value, so strict byte-compare matching failed and the resolver returned -1. With dev=-1, clamp_heaps early-returns and the alloc hook skips hami_budget_reserve — explaining the 44.99 GiB heap and unbounded allocations I saw before.

I haven't pinned down whether this is a driver bug, a missing extension on the loader path, or something specific to the Vulkan loader → ICD shim — that's worth a separate report — but the practical impact is clear.

Fix

Single-GPU heuristic in src/vulkan/physdev_index.c::nvml_index_for_uuid:

if (is_zero_uuid(vk_uuid) && count == 1) {
    HAMI_TRACE("nvml_index_for_uuid: vk_uuid all-zero + NVML count==1 "
               "-> single-GPU fallback idx=0");
    return 0;
}

Safe because the HAMi operating model assigns one GPU per container via the device-plugin. Multi-GPU containers (NVIDIA_VISIBLE_DEVICES enumerating 2+ GPUs) fall through to the strict matcher to avoid mis-binding.

Verification

ws-node074 / NVIDIA RTX 6000 Ada / driver 580.142 / k0s v1.34.3 / Volcano-scheduled pod with volcano.sh/vgpu-memory=23552Mi:

HAMI_VK_TRACE: nvml_index_for_uuid: vk_uuid all-zero + NVML count==1 -> single-GPU fallback idx=0
HAMI_VK_TRACE: physdev_index physDev=0x... -> idx=0
HAMI_VK_TRACE: budget_of dev=0 -> limit=24696061952
HAMI_VK_TRACE: clamp_heaps[0] 48305799168 -> 24696061952
HAMI_VK_TRACE: budget_reserve oom_check dev=0 size=26843545600 -> 1
HAMI_VK_TRACE: hami_vkAllocateMemory: budget reserve REJECTED idx=0 size=26843545600

heap[0] size=23.00 GiB device_local=True
  20 GiB: SUCCESS
  22 GiB: SUCCESS
  25 GiB: VK_ERROR_OUT_OF_DEVICE_MEMORY  <-- partition enforce
  30 GiB: VK_ERROR_OUT_OF_DEVICE_MEMORY  <-- partition enforce

The partition is now enforced end-to-end through the Vulkan path — the heap reported by vkGetPhysicalDeviceMemoryProperties is clamped from 44.99 GiB to 23.00 GiB and vkAllocateMemory correctly returns VK_ERROR_OUT_OF_DEVICE_MEMORY once the budget is exceeded.

Also bundled in this commit

Opt-in HAMI_VK_TRACE=1 instrumentation across budget.c, physdev_index.c, hooks_memory.c::clamp_heaps, and hooks_alloc.c::hami_vkAllocateMemory. Zero-cost when the env var is unset (cached static int check). Useful for future diagnosis of similar driver/UUID issues — happy to factor it out into a separate commit if reviewers prefer.

100milliongold · 2026-04-28T06:12:31Z

Pushed 91ca00c adding the missing build switch for the NVML dlsym redirect.

Background — symptom

On a Vulkan-vGPU pod (RTX 6000 Ada, partition 23 GiB) running Isaac Sim 6.0.0-rc.22 streaming kit, runheadless.sh was reproducibly SegFaulting around 8.7s into init, right after omni.kit.livestream.app / isaacsim.exp.full.streaming startup. exit code 139, no useful backtrace (minidump 0 bytes). Reproducible across pod restart, vscode container restart, and a second pod on the same node — variables constant, result constant.

Root cause

The hook for nvmlDeviceGetMemoryInfo / _v2 already exists in src/nvml/hook.c and the dispatcher in src/libvgpu.c:126 sends nvml-prefixed dlsym lookups into __dlsym_hook_section_nvml. But that dispatcher block is wrapped in #ifdef HOOK_NVML_ENABLE, and although build.sh passes -DHOOK_NVML_ENABLE=1, that's a CMake variable — not a compiler -D. Nothing in the CMake files translates it into target_compile_definitions, so the ifdef compiles out and dlsym(handle, "nvmlDeviceGetMemoryInfo_v2") falls through to the real libnvidia-ml.so symbol.

End result: Vulkan and CUDA paths report the partitioned 23552 MiB heap (correct, this PR's existing behavior), but NVML reports the raw 46068 MiB. Isaac Sim's streaming kit consults NVML during init to plan framebuffer / encoder allocations and SegFaults when those plans collide with the actual partition.

Verification — `readelf` on the freshly built `.so`

Before this commit: readelf --dyn-syms libvgpu.so | grep nvml was empty for the memory-info hooks (they weren't being wired into the dispatcher). After: same dispatcher dispatches into _nvmlDeviceGetMemoryInfo and the LOG_DEBUG into nvmlDeviceGetMemoryInfo trace fires under LIBCUDA_LOG_LEVEL=4.

Verification — production pod on `ws-node074`

Built the new .so on the node, atomic-swapped /usr/local/vgpu/libvgpu.so, recreated the pod so device-plugin re-mounts the new file, then:

# inside the vscode container, container md5 matches the new build
$ md5sum /usr/local/vgpu/libvgpu.so
6291473077c45bf912f296ef1a4367b9  /usr/local/vgpu/libvgpu.so

# nvidia-smi now reports the partition value
$ nvidia-smi --query-gpu=memory.total,memory.free --format=csv
memory.total [MiB], memory.free [MiB]
23552 MiB, 23552 MiB         # was: 46068 MiB, 45458 MiB

# runheadless.sh no longer dies during streaming init
$ timeout 35 ACCEPT_EULA=y /isaac-sim/runheadless.sh
exit=124   # was: exit=139 SegFault @ 8.7s
# last line: [5.010s] [ext: omni.physx.tensors-110.0.7] startup
# crash count in log: 0   # was: 2

Repeated on a second pod (isaac-launchable-1), same result: 23552 MiB and clean init.

The fix is a 6-line CMake change. Diff for reference:

+# Activate NVML dlsym redirect (libvgpu.c:#ifdef HOOK_NVML_ENABLE).
+# Without this define the dispatcher in dlsym() falls through to the real
+# libnvidia-ml so consumers like nvidia-smi / Isaac Sim Kit see the raw
+# 46 GiB heap instead of the partitioned limit, which is inconsistent with
+# the Vulkan/CUDA paths and trips Kit asserts during streaming init.
+target_compile_definitions(${LIBVGPU} PUBLIC HOOK_NVML_ENABLE)

Existing code (_nvmlDeviceGetMemoryInfo, the v1/v2 wrappers, the dlsym table entries DLSYM_HOOK_FUNC(nvmlDeviceGetMemoryInfo) / ..._v2) is untouched — they were already correct, just unreachable. Same likely applies to the other -D…_ENABLE=1 flags in build.sh (MULTIPROCESS_LIMIT_ENABLE, HOOK_MEMINFO_ENABLE, DLSYM_HOOK_ENABLE). Out of scope for this PR but worth a follow-up audit.

100milliongold · 2026-04-28T11:02:23Z

Step B complete — CUDA hook NULL guard hardening

Adds NULL pointer guards to CUDA hooks following the pattern from cuMemGetInfo_v2 (commit 03f99d7):

Hook	Commit	Change
cuMemAlloc_v2	`88143ab`	NULL dptr forwards to driver
cuMemAllocManaged	`275ba3d`	NULL dptr forwards before oom_check
cuMemAllocPitch_v2	`01a58f1`	NULL dptr/pPitch forwards before oom_check
cuMemHostRegister_v2	`7dcb5a4`	drop vestigial cuCtxGetDevice + NULL hptr guard
cuMemHostAlloc	(tests-only `7dcb5a4`)	already forward-first; tests added
cuCtxGetDevice	(tests-only `7dcb5a4`)	already passthrough; tests added
audit notes	`7b76d9b`	docs/notes

Verification

test/test_cuda_null_guards.c — 9 unit tests, all pass under LD_PRELOAD=libvgpu.so on ws-node074. Manual NULL stress test inside isaac-launchable pod (4 hooks × NULL args) all return non-zero error codes, no SegFault. isaac-launchable namespace baseline (5/5 runheadless.sh alive) preserved.

Why

NVIDIA OptiX denoising / Aftermath / Carbonite tasking call HAMi-core hooks during Isaac Sim Kit init with NULL args during fallback probes. Without the guards, libvgpu.so dereferences NULL and SegFaults. Pattern mirrors the existing fix in commit 03f99d7 (cuMemGetInfo_v2). Step C (Vulkan layer compat hardening) follows in a separate plan.

100milliongold · 2026-04-29T01:29:21Z

Step C redesigned — Vulkan layer split into libvgpu_vk.so

The 2026-04-28 attempt (commits 996cb22 cache+Enumerate hooks, eea2beb GIPA fallback — both reverted in this push) regressed runheadless.sh under LD_PRELOAD on ws-node074. Trace evidence in docs/superpowers/notes/2026-04-28-vk-trace-isaac-sim.md proved our layer wrappers were never called; the regression lived at the .so-load boundary. Rather than spending more diagnostic cycles on production hardware, this redesign makes that class of regression structurally impossible.

Commits (this push)

Commit	Change
`f52aada`	Revert: fix(vulkan): GIPA/GDPA fallback to cached next when instance/device unknown
`83fd245`	Revert: fix(vulkan): cache first next-gipa/gdpa + EnumerateDevice* via dispatch table
`1118553`	feat(hami-core): explicit `hami_core_*` export wrappers
`e5812e6`	refactor(vulkan): use `hami_core_*` wrappers instead of internal externs
`b24f71c`	build: split Vulkan layer into separate `libvgpu_vk.so`
`65930f4`	feat(vulkan): ship `hami.json` implicit-layer manifest

(Plus the docs commits already on the branch documenting the trace evidence + reverts.)

What changed

libvgpu.so keeps NVML/CUDA hooks + allocator + multiprocess. Loses all vk* exports.
New libvgpu_vk.so carries the entire src/vulkan/* and exports the Vulkan layer entry points (vkGetInstanceProcAddr, vkGetDeviceProcAddr, vkNegotiateLoaderLayerInterfaceVersion). DT_NEEDED includes libvgpu.so so the linker resolves the 5 hami_core_* wrappers at Vulkan-loader dlopen() time.
share/hami/hami.json is the implicit-layer manifest the Step D webhook drops into /etc/vulkan/implicit_layer.d/.

Verification on ws-node074 (isaac-launchable-0)

Check	Result
`nm -D libvgpu.so \| grep ' T vk'`	0 lines ✓
`nm -D libvgpu.so \| grep ' T hami_core_'`	5 lines ✓
`readelf -d libvgpu_vk.so \| grep NEEDED`	includes `libvgpu.so` ✓
`nm -D --undefined-only libvgpu_vk.so \| grep hami_core_`	5 lines ✓
Step B regression `test_cuda_null_guards`	9/9 [OK] EXIT=0 ✓
`test_alloc` under LD_PRELOAD	EXIT=0 ✓
LD_PRELOAD `libvgpu.so` w/o manifest, `runheadless.sh` × 5	5/5 `exit=124 crash=0 listen=1` ✓
LD_PRELOAD `libvgpu.so` + manifest in pod, `runheadless.sh` × 5	5/5 alive, manifest enumerated, NVML clamp 23552 MiB ✓
`HAMI_VK_TRACE` lines on manifest path	0 — Kit's embedded Conan vulkan-loader path didn't traverse our GIPA in this run; partition Vulkan-side enforcement to be confirmed in Step D's 4-path verification

The first row above is the headline: the 2026-04-28 regression class is structurally gone. Production /usr/local/vgpu/libvgpu.so was restored to the pre-Step-C backup (md5 8f889313) after verification.