feat(vulkan): Vulkan implicit layer to enforce per-pod GPU memory budget on vkAllocateMemory#182
feat(vulkan): Vulkan implicit layer to enforce per-pod GPU memory budget on vkAllocateMemory#182100milliongold wants to merge 44 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: 100milliongold The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @100milliongold! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉 |
af64c20 to
3bebc8a
Compare
Honest status update — Vulkan hook chain not fully wired in productionI want to disclose a partial regression I found while doing end-to-end validation on a real cluster (RTX 6000 Ada × 2, NVIDIA driver 580.142, K8s 1.34, isaac-launchable production workload). I would love guidance on the right fix. What works
What does NOT work (yet)A small Vulkan-only test that talks to So the loader sees the layer and inserts it into the call stack, but at dispatch time the calls go straight to the ICD — our Where I am stuckThe most likely failure modes I am tracking:
I plan to add DisclosurePer the AI Assistance Notice — Claude (Anthropic) was used as an editing/exploration assistant for this work, but every decision, build, and validation step was run and verified by me on the actual cluster. The failure mode above is a real one I observed in production; I am bringing it to you because the PR is not yet shippable and I would rather acknowledge that openly than push a half-working layer. Companion PR status
Thanks for taking another look. /cc @archlitchi @wawa0210 @FouoF |
Update — partition enforcement now verified end-to-endFollowing up on my earlier disclosure, I traced the root cause and pushed a fix in 93dd103. Root causeThe Vulkan loader chain, layer entry points, and
I haven't pinned down whether this is a driver bug, a missing extension on the loader path, or something specific to the Vulkan loader → ICD shim — that's worth a separate report — but the practical impact is clear. FixSingle-GPU heuristic in if (is_zero_uuid(vk_uuid) && count == 1) {
HAMI_TRACE("nvml_index_for_uuid: vk_uuid all-zero + NVML count==1 "
"-> single-GPU fallback idx=0");
return 0;
}Safe because the HAMi operating model assigns one GPU per container via the device-plugin. Multi-GPU containers ( Verificationws-node074 / NVIDIA RTX 6000 Ada / driver 580.142 / k0s v1.34.3 / Volcano-scheduled pod with The partition is now enforced end-to-end through the Vulkan path — the heap reported by Also bundled in this commitOpt-in |
|
Pushed Background — symptomOn a Vulkan-vGPU pod (RTX 6000 Ada, partition 23 GiB) running Isaac Sim 6.0.0-rc.22 streaming kit, Root causeThe hook for End result: Vulkan and CUDA paths report the partitioned 23552 MiB heap (correct, this PR's existing behavior), but NVML reports the raw 46068 MiB. Isaac Sim's streaming kit consults NVML during init to plan framebuffer / encoder allocations and SegFaults when those plans collide with the actual partition. Verification —
|
Step B complete — CUDA hook NULL guard hardeningAdds NULL pointer guards to CUDA hooks following the pattern from
Verification
WhyNVIDIA OptiX denoising / Aftermath / Carbonite tasking call HAMi-core hooks during Isaac Sim Kit init with NULL args during fallback probes. Without the guards, libvgpu.so dereferences NULL and SegFaults. Pattern mirrors the existing fix in commit 03f99d7 (cuMemGetInfo_v2). Step C (Vulkan layer compat hardening) follows in a separate plan. |
Step C redesigned — Vulkan layer split into libvgpu_vk.soThe 2026-04-28 attempt (commits Commits (this push)
(Plus the docs commits already on the branch documenting the trace evidence + reverts.) What changed
Verification on ws-node074 (isaac-launchable-0)
The first row above is the headline: the 2026-04-28 regression class is structurally gone. Production Out of scope for this PR
Spec / Plan
|
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Adds hami_vkGetPhysicalDeviceMemoryProperties[2] hooks that forward to the next layer and then clamp each VK_MEMORY_HEAP_DEVICE_LOCAL heap size down to the pod budget returned by hami_budget_of(). A budget of 0 is treated as unlimited and skips clamping. A pointer-hash physdev_index() is used provisionally; Task 1.6 replaces it with an NVML UUID lookup. Also guards the dispatch resolver against a NULL gipa/gdpa so unit tests can register a dispatch entry and populate function pointers manually. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…_limiter Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…r manifest Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Replaces the pointer-hash physdev_index heuristic with a proper VkPhysicalDevice → NVML device-index mapping. Walks registered instance dispatches to fetch VkPhysicalDeviceIDProperties.deviceUUID, then matches it against NVML device UUIDs (nvmlDeviceGetUUID). Result cached per VkPhysicalDevice. On unresolved devices (software rasterizer, NVML unavailable) returns -1 and callers skip budget enforcement. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Vulkan-Headers < 1.3 (Ubuntu 20.04 libvulkan-dev) lacks PFN_vkQueueSubmit2 and VkSubmitInfo2. Guard the struct member, dispatch population, hook wrapper and layer PFN entry so libvgpu.so builds on older header sets. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
HAMi-core's CUDA symbol trampoline is populated via cuInit() -> preInit() -> load_cuda_libraries(). CUDA apps trigger this naturally, but Vulkan-only apps never call cuInit, leaving oom_check without a valid cuDeviceGetCount pointer (Hijack failed error). Call cuInit(0) once from budget.c's public entry points to force initialisation. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…stness) Forwards NULL dptr calls to the real CUDA driver so the caller sees the driver's defined error code (CUDA_ERROR_INVALID_VALUE) instead of HAMi dereferencing the NULL inside allocate_raw. NVIDIA OptiX/Aftermath internal init paths historically pass NULL during fallback probes; without this guard libvgpu.so SegFaults inside Isaac Sim Kit init under LD_PRELOAD. Pattern matches commit 03f99d7 (cuMemGetInfo_v2). Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…rage Same robustness pattern as Task 2 (cuMemAlloc_v2). cuMemAllocManaged now forwards NULL dptr to the real driver to surface CUDA_ERROR_INVALID_VALUE, instead of running oom_check first which could mask the real driver error as CUDA_ERROR_OUT_OF_MEMORY when oom_check trips before the driver gets called. cuMemAllocHost_v2 was verified safe by baseline test (forward-first pattern already returns the driver error for NULL hptr without crashing); no source change there. Test file extended with NULL-pointer cases for both functions to lock in the contract. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…cleanup Step B Tasks 5-7 bundle: - cuMemHostAlloc: tests only (already forward-first, no semantic change needed) - cuMemHostRegister_v2: drop vestigial cuCtxGetDevice(&dev) (result ignored, dev unused) + add explicit NULL hptr guard for forward-first consistency - cuCtxGetDevice: tests only (pure passthrough) Pattern matches cuMemAlloc_v2 (commit 88143ab), cuMemAllocManaged (275ba3d), cuMemAllocPitch_v2 (01a58f1). Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…ch table
Foundation for Step C compat hardening:
* dispatch.{h,c}: add EnumerateDeviceExtensionProperties +
EnumerateDeviceLayerProperties function pointers to the per-instance
dispatch struct; resolve both during hami_instance_register so the
layer's own Enumerate* hooks can forward correctly. Add
hami_instance_first() helper that returns the first registered
instance dispatch under lock — used by NULL-instance Enumerate
forwarding when the loader probes before any instance has been
created.
* layer.c: cache the first next-layer GetInstanceProcAddr /
GetDeviceProcAddr in static globals during CreateInstance /
CreateDevice. Expands comments documenting the Vulkan 1.3 §38.3.1
contract for own-name vs NULL pLayerName Enumerate semantics, and
why an earlier draft returning LAYER_NOT_PRESENT broke
vkCreateDevice.
This commit only restructures the existing Enumerate hooks; it does not
yet change GIPA/GDPA fallback behavior (Task 2).
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…nknown NVIDIA driver and Carbonite probe through our GIPA/GDPA with handles that may not yet be registered: during vkCreateInstance before our register completes, or with upper-layer-wrapped handles. Returning NULL there crashed the caller (SegFault inside libcarb.graphics-vulkan when assembling the dispatch table). Now we forward to the first-cached next_gipa/next_gdpa from a previous CreateInstance/CreateDevice. Only when both per-handle lookup AND the cache are absent do we return NULL — that's the legitimate pre-CreateInstance loader bootstrap window where Enumerate* hooks have already been matched at the top of the function. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Step C Task 5 audit. Read-only review concludes: no code change needed. Lifetime: lookup-then-use across dropped lock relies on Vulkan 1.3 §3.6 externally-synchronized-parameters contract. VkInstance/VkDevice destroy must be application-serialized — Carbonite and Isaac Sim Kit comply. Same pattern as VK_LAYER_KHRONOS_validation and nvidia layers (extending the lock across unbounded next-chain calls would deadlock). Chain: in-place advance of chain->u.pLayerInfo->pNext is the canonical Khronos vulkan-loader recommendation (LoaderLayerInterface.md). Loader allocates fresh VkLayerInstanceCreateInfo per CreateInstance call; reuse is structurally impossible. Reference layers all do in-place advance. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
… path Step C Task 3 trace + comparative testing. Evidence: pre-Step-C 7dcb5a4 alive under LD_PRELOAD forced. Post-Step-C eea2beb crashes (exit=139) at NVIDIA ICD init in libGLX_nvidia.so.0 vk_icdNegotiateLoaderICDInterfaceVersion -> libEGL_nvidia.so.0 __egl_Main -> __sigaction. HAMI_VK_TRACE=0 lines (crash before our wrappers run). Hypothesis: Task 1 HAMI_HOOK(EnumerateDeviceExtensionProperties) + HAMI_HOOK(EnumerateDeviceLayerProperties) intercept ICD-side global GIPA lookup under LD_PRELOAD-only path (no manifest activation), and return 0 entries when g_inst_head == NULL. NVIDIA driver expects the GIPA chain to fall through to the ICD instead. Production .so on ws-node074 restored to pre-Step-C backup (md5 8f889313). isaac-launchable-0 confirmed alive after restore. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…unknown Run 5 attempted to gate HAMI_HOOK(EnumerateDevice*) on g_inst_head != NULL. Same crash, same backtrace, same HAMI_VK_TRACE=0 lines. Hypothesis that Step C Task 1's HAMI_HOOK additions hijack NVIDIA ICD's GIPA lookup is FALSIFIED — our wrapper is never called yet the crash still happens. Differential surface narrows to .so-load-time effects (exports, static init) rather than Vulkan wrapper logic. Further bisect blocked by sandbox: clean rebuild of 7dcb5a4 to compare md5 against 8f889313 (production backup) was denied. Production .so restored to 8f889313 again. isaac-launchable-0 alive verified post-restore. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…device unknown" This reverts commit eea2beb. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…a dispatch table" This reverts commit 996cb22. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Five thin wrappers around the HAMi-core symbols that libvgpu_vk.so
will need after the upcoming Vulkan-layer split: oom_check,
add/rm_gpu_device_memory_usage, get_current_device_memory_limit,
rate_limiter.
All five carry __attribute__((visibility("default"))) so that the
release build (-fvisibility=hidden) keeps the export surface narrow:
libvgpu_vk.so DT_NEEDED-resolves only these names and nothing else from
HAMi-core internals. No call-site changes yet — that follows in the next
commit.
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Replace the extern declarations of oom_check / add_/rm_gpu_device_ memory_usage / get_current_device_memory_limit / rate_limiter in src/vulkan/budget.c and src/vulkan/throttle_adapter.c with calls through the new include/hami_core_export.h interface. This is a pure call-site rewrite — same runtime behavior, same .so boundary (still linked into one libvgpu.so for now). The point is to remove direct dependence on HAMi-core internal symbol names so the upcoming libvgpu_vk.so split can keep DT_NEEDED narrow. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
libvgpu.so loses vulkan_mod and now contains only HAMi-core (NVML/CUDA hooks + allocator + multiprocess). libvgpu_vk.so is a new shared target that holds all of src/vulkan/* and links libvgpu.so as DT_NEEDED so the hami_core_* wrappers resolve when the Vulkan loader dlopen()s the new .so via the implicit-layer manifest. After this commit: * nm -D libvgpu.so MUST NOT show vk* * nm -D libvgpu_vk.so MUST show vkGetInstanceProcAddr, vkGetDeviceProcAddr, vkNegotiateLoaderLayerInterfaceVersion (and only those as exports thanks to -fvisibility=hidden + HAMI_LAYER_EXPORT). * readelf -d libvgpu_vk.so MUST list libvgpu.so as NEEDED. Step C plan: docs/superpowers/plans/2026-04-29-step-c-vk-so-split.md Spec: docs/superpowers/specs/2026-04-29-step-c-redesign-vk-so-split.md Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Static manifest that the Step D webhook + DaemonSet will install into /etc/vulkan/implicit_layer.d/ to activate libvgpu_vk.so via the Vulkan loader. file_format_version 1.0.0, type INSTANCE, api 1.3.0. library_path is the production install path /usr/local/vgpu/libvgpu_vk.so; no extensions claimed (the layer only intercepts existing entry points). Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…ch table
Foundation for Step C compat hardening:
* dispatch.{h,c}: add EnumerateDeviceExtensionProperties +
EnumerateDeviceLayerProperties function pointers to the per-instance
dispatch struct; resolve both during hami_instance_register so the
layer's own Enumerate* hooks can forward correctly. Add
hami_instance_first() helper that returns the first registered
instance dispatch under lock — used by NULL-instance Enumerate
forwarding when the loader probes before any instance has been
created.
* layer.c: cache the first next-layer GetInstanceProcAddr /
GetDeviceProcAddr in static globals during CreateInstance /
CreateDevice. Expands comments documenting the Vulkan 1.3 §38.3.1
contract for own-name vs NULL pLayerName Enumerate semantics, and
why an earlier draft returning LAYER_NOT_PRESENT broke
vkCreateDevice.
This commit only restructures the existing Enumerate hooks; it does not
yet change GIPA/GDPA fallback behavior (Task 2).
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…nknown NVIDIA driver and Carbonite probe through our GIPA/GDPA with handles that may not yet be registered: during vkCreateInstance before our register completes, or with upper-layer-wrapped handles. Returning NULL there crashed the caller (SegFault inside libcarb.graphics-vulkan when assembling the dispatch table). Now we forward to the first-cached next_gipa/next_gdpa from a previous CreateInstance/CreateDevice. Only when both per-handle lookup AND the cache are absent do we return NULL — that's the legitimate pre-CreateInstance loader bootstrap window where Enumerate* hooks have already been matched at the top of the function. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
cpplint flagged src/vulkan/layer.c:331 at 126 chars. Split the HAMI_TRACE format string and arg onto two lines so the longest line is ~107 chars. Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
4bf5912 to
c3beead
Compare
e71fa3c to
292300c
Compare
|
/retest (Build libvgpu CI failure is a transient apt mirror network issue — |
|
@100milliongold: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
292300c to
c12ac59
Compare
…te heapUsage on budget query
Two related fixes for VK_EXT_memory_budget on Carbonite/Kit and similar
engines that target a Vulkan 1.0 instance with the KHR-extension form of
properties2:
(1) Hook the KHR alias of vkGetPhysicalDeviceMemoryProperties2
The HAMI_HOOK macro previously matched only the core (Vulkan 1.1+)
name. Loaders queried our layer's GetInstanceProcAddr for the KHR
alias `vkGetPhysicalDeviceMemoryProperties2KHR` and fell through to
the next layer / ICD when we returned NULL. Result: clamp_heaps was
never invoked on those calls, so engines saw the host GPU's full
heap size (~45 GiB) instead of the partition limit. Add a
HAMI_HOOK_KHR_ALIAS macro and wire it for the memory-properties2
entry point so the same hami_vk* function services both names.
(2) Inflate heapUsage rather than clamping heapBudget
Engines that consume VK_EXT_memory_budget render an on-screen
"X used / Y available" overlay computed as heapBudget - heapUsage.
Clamping heapBudget to the partition limit produces the right
"available" value but causes omni.physx.tensors to deadlock during
plugin initialization (Isaac Sim Streaming 6.0, NVIDIA 580.142).
The deadlock persists regardless of whether heapBudget is clamped
to partition_limit, partition_limit-heapUsage, or any value below
heap.size — PhysX/Carbonite consumes the absolute heapBudget
through paths beyond the simple subtraction.
Workaround: leave heapBudget at the ICD-reported value (host GPU's
free memory) and inflate heapUsage by (icd_budget - partition_limit).
The overlay's available calculation now equals
heapBudget - (icd_usage + delta) = partition_limit - icd_usage,
matching the partition. PhysX is unaffected because it does not
consult heapUsage.
Verified end-to-end on isaac-launchable-0 (23 GiB partition,
ws-node074):
Pre-fix: size=44.99 GiB heapBudget=32.12 GiB
overlay "1.3 used / 32.2 available" [false]
Post-fix: size=23.00 GiB heapBudget=33.4 GiB heapUsage=11.5 GiB
overlay "11.5 used / 21.9 available" [21.9 = 23 - 1.1]
omni.physx.tensors initializes cleanly; Kit reaches the
streaming-ready idle state at 60 fps.
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
c12ac59 to
58d304f
Compare
Summary
Adds a Vulkan implicit layer to HAMi-core that hooks
vkAllocateMemory/vkFreeMemoryand enforces the same per-pod memory budget HAMi-core already enforces for CUDA. This closes the gap where Vulkan workloads (Isaac Sim, ray tracing, GPU-accelerated rendering) bypass the limit because allocations don't go through the CUDA driver.This is the companion to Project-HAMi/HAMi#1803, which wires up the manifest mount in the device-plugin and the env injection in the admission webhook. That PR cannot be merged upstream until this layer ships in HAMi-core, so I'm opening this first.
Why
HAMi-core currently enforces vGPU memory limits by intercepting CUDA driver calls (
cuMemAlloc,cuMemAllocAsync, etc.). Vulkan applications allocate device memory throughvkAllocateMemory, which goes directly to the NVIDIA Vulkan ICD and never touches HAMi-core's CUDA hook table. We hit this in production with Isaac Sim — Kit allocates several GB through Vulkan, ignored the requested partition, and OOM'd the host.The Vulkan layer reuses HAMi-core's existing CUDA budget bookkeeping (
CUDA_DEVICE_MEMORY_LIMIT_0etc.) so a pod that asks for 4000 MiB gets the same enforcement on both APIs.Design
enable_environment: HAMI_VULKAN_ENABLE=1, so the layer only activates when the pod opts in. CUDA-only workloads see no behavior change.vkCreateInstanceto chain the dispatch table, thenvkAllocateMemory,vkFreeMemory,vkGetPhysicalDeviceMemoryProperties[2], andvkQueueSubmit[2]. Other entry points pass through unchanged.vkQueueSubmit[2]calls through HAMi-core'srate_limiterso the existing core-utilization throttling also applies to Vulkan workloads.vkGetPhysicalDeviceMemoryPropertiesdon't try to grab the full physical heap.What changed
src/vulkan/layer.{c,h}vkCreateInstance/vkGetInstanceProcAddr/vkGetDeviceProcAddr).src/vulkan/dispatch.{c,h}src/vulkan/hooks_alloc.cvkAllocateMemory/vkFreeMemoryhooks that consult HAMi-core's CUDA budget.src/vulkan/hooks_memory.cvkGetPhysicalDeviceMemoryProperties[2]clamping device-local heap to pod budget.src/vulkan/hooks_submit.cvkQueueSubmit[2]rate-limiting viathrottle_adapter.src/vulkan/budget.hsrc/vulkan/throttle_adapter.{c,h}rate_limiter.src/vulkan/physdev_index.{c,h}VkPhysicalDevice→ device index via NVML UUID.src/vulkan/hami_implicit_layer.jsonenable_environment: HAMI_VULKAN_ENABLE=1.CMakeLists.txt(+ install rules)vulkan_modas an OBJECT library linked intolibvgpu.so; install the manifest. Library symbol surface unchanged for CUDA-only consumers.test/vulkan/*.ctest/CMakeLists.txtsrc/cuda/...(2 fixes)cuMemFree[Async]falls back to the real driver when called with an untracked pointer (some Vulkan ICD callbacks free pointers we never tracked).cuMemGetInfo_v2guards against NULL out params (OptiX crash repro).Build
vulkan_modis built as an OBJECT library and linked intolibvgpu.so, so:.so, so HAMi only needs to mount one library.\$\{CMAKE_INSTALL_PREFIX\}/etc/vulkan/implicit_layer.d/hami.json.Vulkan headers >= 1.3.280 are required for
vkQueueSubmit2. A fallback for theVK_LAYER_EXPORTmacro covers older headers.Test plan
make test— unit tests undertest/vulkan/pass on a host without an NVIDIA GPU (mocked dispatch).make docker(withlibvulkan-devinstalled).nvidia.com/gpumem: 4000+hami.io/vulkan: "true". Kit boot log reportsGPU Memory: 4000 MBand the workload is held to it.Compatibility
enable_environmentand HAMi-core's existing per-pod budget — neither activates without the pod opting in.vkAllocateMemory, queue submit throttled like CUDA kernel launches.Notes for reviewers
vkQueueSubmit2. If the project prefers an older minimum, the relevant code path is already guarded byVK_VERSION_1_3and can be made strictly conditional.src/cuda/*fixes (cuMemFree[Async]untracked-pointer fallback,cuMemGetInfo_v2NULL guard) are small and stand on their own — happy to split them into a separate PR if it makes review easier.xiilab/HAMi-core@vulkan-layer; once this PR is merged I'll update it to point at the upstream commit.