Skip to content

feat(vulkan): Vulkan implicit layer to enforce per-pod GPU memory budget on vkAllocateMemory#182

Open
100milliongold wants to merge 44 commits into
Project-HAMi:mainfrom
xiilab:vulkan-layer
Open

feat(vulkan): Vulkan implicit layer to enforce per-pod GPU memory budget on vkAllocateMemory#182
100milliongold wants to merge 44 commits into
Project-HAMi:mainfrom
xiilab:vulkan-layer

Conversation

@100milliongold
Copy link
Copy Markdown

Summary

Adds a Vulkan implicit layer to HAMi-core that hooks vkAllocateMemory / vkFreeMemory and enforces the same per-pod memory budget HAMi-core already enforces for CUDA. This closes the gap where Vulkan workloads (Isaac Sim, ray tracing, GPU-accelerated rendering) bypass the limit because allocations don't go through the CUDA driver.

This is the companion to Project-HAMi/HAMi#1803, which wires up the manifest mount in the device-plugin and the env injection in the admission webhook. That PR cannot be merged upstream until this layer ships in HAMi-core, so I'm opening this first.

Why

HAMi-core currently enforces vGPU memory limits by intercepting CUDA driver calls (cuMemAlloc, cuMemAllocAsync, etc.). Vulkan applications allocate device memory through vkAllocateMemory, which goes directly to the NVIDIA Vulkan ICD and never touches HAMi-core's CUDA hook table. We hit this in production with Isaac Sim — Kit allocates several GB through Vulkan, ignored the requested partition, and OOM'd the host.

The Vulkan layer reuses HAMi-core's existing CUDA budget bookkeeping (CUDA_DEVICE_MEMORY_LIMIT_0 etc.) so a pod that asks for 4000 MiB gets the same enforcement on both APIs.

Design

  • The layer is a standard Vulkan implicit layer — Vulkan's loader picks it up automatically when the manifest is on the system.
  • The manifest is gated by enable_environment: HAMI_VULKAN_ENABLE=1, so the layer only activates when the pod opts in. CUDA-only workloads see no behavior change.
  • The layer hooks vkCreateInstance to chain the dispatch table, then vkAllocateMemory, vkFreeMemory, vkGetPhysicalDeviceMemoryProperties[2], and vkQueueSubmit[2]. Other entry points pass through unchanged.
  • Device index resolution uses NVML UUID lookup so the per-device budget matches what HAMi-core already tracks for CUDA.
  • A small "budget adapter" forwards Vulkan allocation accounting into HAMi-core's existing counter machinery — no parallel bookkeeping, no risk of CUDA and Vulkan budgets drifting apart.
  • A "throttle adapter" forwards vkQueueSubmit[2] calls through HAMi-core's rate_limiter so the existing core-utilization throttling also applies to Vulkan workloads.
  • Memory property queries clamp the device-local heap size to the per-pod budget so apps that pre-size their allocations from vkGetPhysicalDeviceMemoryProperties don't try to grab the full physical heap.

What changed

Area Change
src/vulkan/layer.{c,h} Layer entry point + dispatch table chaining (vkCreateInstance / vkGetInstanceProcAddr / vkGetDeviceProcAddr).
src/vulkan/dispatch.{c,h} Per-instance dispatch table loader.
src/vulkan/hooks_alloc.c vkAllocateMemory / vkFreeMemory hooks that consult HAMi-core's CUDA budget.
src/vulkan/hooks_memory.c vkGetPhysicalDeviceMemoryProperties[2] clamping device-local heap to pod budget.
src/vulkan/hooks_submit.c vkQueueSubmit[2] rate-limiting via throttle_adapter.
src/vulkan/budget.h Adapter type forwarding allocation events into HAMi-core counters.
src/vulkan/throttle_adapter.{c,h} Adapter forwarding queue submit throttling to rate_limiter.
src/vulkan/physdev_index.{c,h} Resolve VkPhysicalDevice → device index via NVML UUID.
src/vulkan/hami_implicit_layer.json Vulkan implicit-layer manifest, gated by enable_environment: HAMI_VULKAN_ENABLE=1.
CMakeLists.txt (+ install rules) Build vulkan_mod as an OBJECT library linked into libvgpu.so; install the manifest. Library symbol surface unchanged for CUDA-only consumers.
test/vulkan/*.c Unit tests: layer init, allocation budget, memprops clamp, queue submit throttle, throttle adapter forwarding.
test/CMakeLists.txt Vulkan tests built as a separate target so existing CUDA test glob is unchanged.
src/cuda/... (2 fixes) cuMemFree[Async] falls back to the real driver when called with an untracked pointer (some Vulkan ICD callbacks free pointers we never tracked). cuMemGetInfo_v2 guards against NULL out params (OptiX crash repro).

Build

vulkan_mod is built as an OBJECT library and linked into libvgpu.so, so:

  • The shared library symbol surface for existing CUDA-only consumers is unchanged.
  • The Vulkan layer ships inside the same .so, so HAMi only needs to mount one library.
  • The manifest installs to \$\{CMAKE_INSTALL_PREFIX\}/etc/vulkan/implicit_layer.d/hami.json.

Vulkan headers >= 1.3.280 are required for vkQueueSubmit2. A fallback for the VK_LAYER_EXPORT macro covers older headers.

Test plan

  • make test — unit tests under test/vulkan/ pass on a host without an NVIDIA GPU (mocked dispatch).
  • Built as part of HAMi's make docker (with libvulkan-dev installed).
  • E2E: deployed inside a HAMi cluster (with the companion HAMi PR), ran an Isaac Sim pod with nvidia.com/gpumem: 4000 + hami.io/vulkan: "true". Kit boot log reports GPU Memory: 4000 MB and the workload is held to it.
  • Regression: same image without the annotation runs CUDA workloads bit-identical to the previous release.

Compatibility

  • CUDA-only workloads: zero behavior change. The Vulkan layer is gated by both the manifest's enable_environment and HAMi-core's existing per-pod budget — neither activates without the pod opting in.
  • Vulkan workloads without the annotation: no Vulkan layer loaded, no env, no behavior change.
  • Vulkan workloads with the annotation: per-pod memory budget enforced on vkAllocateMemory, queue submit throttled like CUDA kernel launches.

Notes for reviewers

  • This branch uses Vulkan-Headers 1.3.280+ to access vkQueueSubmit2. If the project prefers an older minimum, the relevant code path is already guarded by VK_VERSION_1_3 and can be made strictly conditional.
  • The two src/cuda/* fixes (cuMemFree[Async] untracked-pointer fallback, cuMemGetInfo_v2 NULL guard) are small and stand on their own — happy to split them into a separate PR if it makes review easier.
  • I can split this PR into smaller pieces (skeleton + dispatch / alloc hooks / submit throttle / memprops clamp / build integration / tests) if that's easier to review.
  • The companion HAMi PR is Project-HAMi/HAMi#1803. It currently points the libvgpu submodule at xiilab/HAMi-core@vulkan-layer; once this PR is merged I'll update it to point at the upstream commit.

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 100milliongold
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 27, 2026

Welcome @100milliongold! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

@100milliongold
Copy link
Copy Markdown
Author

Honest status update — Vulkan hook chain not fully wired in production

I want to disclose a partial regression I found while doing end-to-end validation on a real cluster (RTX 6000 Ada × 2, NVIDIA driver 580.142, K8s 1.34, isaac-launchable production workload). I would love guidance on the right fix.

What works

  • webhook env injection (HAMI_VULKAN_ENABLE=1, graphics capability merge) — verified via container env.
  • volcano-vgpu-device-plugin Vulkan manifest auto-mount at /etc/vulkan/implicit_layer.d/hami.json — verified.
  • The implicit-layer manifest is parsed by the Vulkan loader; Loading layer library /usr/local/vgpu/libvgpu.so, Insert instance layer "VK_LAYER_HAMI_vgpu", vkCreateInstance layer callstack setup to: VK_LAYER_HAMI_vgpu, and the corresponding device-side line all show up under VK_LOADER_DEBUG=all.
  • After commit 30ddb01 the loader entry points (vkNegotiateLoaderLayerInterfaceVersion, vkGetInstanceProcAddr, vkGetDeviceProcAddr) are exported with default visibility (an earlier build had -fvisibility=hidden and only hami_*-prefixed symbols, so the loader could not resolve the layer at all — that part is now fixed).
  • CUDA-path enforcement is intact: PyTorch ≥ 25 GiB allocations on a 23 GiB-budgeted pod return CUDA out of memory from HAMi-core's cuMemAlloc hook.

What does NOT work (yet)

A small Vulkan-only test that talks to libvulkan.so.1 directly (ctypes, minimal vkCreateInstancevkGetPhysicalDeviceMemoryPropertiesvkCreateDevicevkAllocateMemory(25 GiB)) shows that the layer's hook functions are not actually called:

heap[0] size = 44.99 GiB device_local=True   ← native, not 23 GiB; vkGetPhysicalDeviceMemoryProperties hook bypassed
=== vkAllocateMemory tests ===
  20 GiB: SUCCESS
  22 GiB: SUCCESS
  25 GiB: SUCCESS                            ← over the 23 GiB budget
  30 GiB: SUCCESS                            ← also over

So the loader sees the layer and inserts it into the call stack, but at dispatch time the calls go straight to the ICD — our hami_vkGetPhysicalDeviceMemoryProperties and hami_vkAllocateMemory are never invoked. The same is presumably true for hami_vkQueueSubmit (which is why I cannot yet show queue-throttle evidence either).

Where I am stuck

The most likely failure modes I am tracking:

  1. hami_vkCreateInstance not actually entered. If our hami_vkCreateInstance is bypassed by the loader, hami_instance_register() never runs, so the instance dispatch table our hooks rely on is empty, and hami_vkGetInstanceProcAddr falls back through if (!d) return NULL; for any name that isn't in the explicit HAMI_HOOK(...) list. I have not yet been able to confirm whether hami_vkCreateInstance is being called.
  2. Wrong layer interface version with manifest 1.2.0. Our manifest is file_format_version: "1.2.0" and we export vkNegotiateLoaderLayerInterfaceVersion, so the loader should pick interface v2. The loader debug output says using deprecated 'vkGetInstanceProcAddr' tag when I add a functions block to the manifest, and shows neither Enable Env Var nor any negotiate trace when I remove that block — I cannot tell from the log alone which path actually wins.
  3. Layer hook is built but not chain-linked. nm -D libvgpu.so shows all hami_* and the canonical vk* entry points exported. Yet we still see no hook activity at runtime. This makes me suspect a chain-info handling bug in src/vulkan/layer.c::hami_vkCreateInstance (e.g. wrong VK_LAYER_LINK_INFO walk, or chain->u.pLayerInfo = chain->u.pLayerInfo->pNext advancing into NULL).

I plan to add fprintf(stderr, "HAMI: hami_vkCreateInstance entered\n") style instrumentation, rebuild, and rerun the ctypes test to localize which of (1)–(3) is actually true. But before I send the next round, I would really value your read on the chain handling in src/vulkan/layer.c — if you spot something obviously wrong from a quick skim it would save a debug cycle.

Disclosure

Per the AI Assistance Notice — Claude (Anthropic) was used as an editing/exploration assistant for this work, but every decision, build, and validation step was run and verified by me on the actual cluster. The failure mode above is a real one I observed in production; I am bringing it to you because the PR is not yet shippable and I would rather acknowledge that openly than push a half-working layer.

Companion PR status

Thanks for taking another look. /cc @archlitchi @wawa0210 @FouoF

@100milliongold
Copy link
Copy Markdown
Author

Update — partition enforcement now verified end-to-end

Following up on my earlier disclosure, I traced the root cause and pushed a fix in 93dd103.

Root cause

The Vulkan loader chain, layer entry points, and vkAllocateMemory hook were all wired correctly — that part of this PR works. The break was in the physical-device → NVML index resolver:

HAMI_VK_TRACE: vk_uuid (target) = 00000000-0000-0000-0000-000000000000
HAMI_VK_TRACE: nvml[0] uuid_str='GPU-07ac64f7-ff7f-d9fb-9bcb-7afb7720386f'
HAMI_VK_TRACE: nvml_index_for_uuid: NO MATCH -> -1
HAMI_VK_TRACE: clamp_heaps physdev_index -> dev=-1
HAMI_VK_TRACE: clamp_heaps EARLY RETURN (dev<0)
HAMI_VK_TRACE: hami_vkAllocateMemory: idx<0 -> SKIP budget enforcement

VkPhysicalDeviceIDProperties.deviceUUID returned by NVIDIA driver 580.142 inside the container was 16 bytes of zero, even though vkGetPhysicalDeviceProperties2 returns successfully and VK_KHR_external_memory_capabilities is exposed. NVML's nvmlDeviceGetUUID returned the correct GPU-… value, so strict byte-compare matching failed and the resolver returned -1. With dev=-1, clamp_heaps early-returns and the alloc hook skips hami_budget_reserve — explaining the 44.99 GiB heap and unbounded allocations I saw before.

I haven't pinned down whether this is a driver bug, a missing extension on the loader path, or something specific to the Vulkan loader → ICD shim — that's worth a separate report — but the practical impact is clear.

Fix

Single-GPU heuristic in src/vulkan/physdev_index.c::nvml_index_for_uuid:

if (is_zero_uuid(vk_uuid) && count == 1) {
    HAMI_TRACE("nvml_index_for_uuid: vk_uuid all-zero + NVML count==1 "
               "-> single-GPU fallback idx=0");
    return 0;
}

Safe because the HAMi operating model assigns one GPU per container via the device-plugin. Multi-GPU containers (NVIDIA_VISIBLE_DEVICES enumerating 2+ GPUs) fall through to the strict matcher to avoid mis-binding.

Verification

ws-node074 / NVIDIA RTX 6000 Ada / driver 580.142 / k0s v1.34.3 / Volcano-scheduled pod with volcano.sh/vgpu-memory=23552Mi:

HAMI_VK_TRACE: nvml_index_for_uuid: vk_uuid all-zero + NVML count==1 -> single-GPU fallback idx=0
HAMI_VK_TRACE: physdev_index physDev=0x... -> idx=0
HAMI_VK_TRACE: budget_of dev=0 -> limit=24696061952
HAMI_VK_TRACE: clamp_heaps[0] 48305799168 -> 24696061952
HAMI_VK_TRACE: budget_reserve oom_check dev=0 size=26843545600 -> 1
HAMI_VK_TRACE: hami_vkAllocateMemory: budget reserve REJECTED idx=0 size=26843545600

heap[0] size=23.00 GiB device_local=True
  20 GiB: SUCCESS
  22 GiB: SUCCESS
  25 GiB: VK_ERROR_OUT_OF_DEVICE_MEMORY  <-- partition enforce
  30 GiB: VK_ERROR_OUT_OF_DEVICE_MEMORY  <-- partition enforce

The partition is now enforced end-to-end through the Vulkan path — the heap reported by vkGetPhysicalDeviceMemoryProperties is clamped from 44.99 GiB to 23.00 GiB and vkAllocateMemory correctly returns VK_ERROR_OUT_OF_DEVICE_MEMORY once the budget is exceeded.

Also bundled in this commit

Opt-in HAMI_VK_TRACE=1 instrumentation across budget.c, physdev_index.c, hooks_memory.c::clamp_heaps, and hooks_alloc.c::hami_vkAllocateMemory. Zero-cost when the env var is unset (cached static int check). Useful for future diagnosis of similar driver/UUID issues — happy to factor it out into a separate commit if reviewers prefer.

@100milliongold
Copy link
Copy Markdown
Author

Pushed 91ca00c adding the missing build switch for the NVML dlsym redirect.

Background — symptom

On a Vulkan-vGPU pod (RTX 6000 Ada, partition 23 GiB) running Isaac Sim 6.0.0-rc.22 streaming kit, runheadless.sh was reproducibly SegFaulting around 8.7s into init, right after omni.kit.livestream.app / isaacsim.exp.full.streaming startup. exit code 139, no useful backtrace (minidump 0 bytes). Reproducible across pod restart, vscode container restart, and a second pod on the same node — variables constant, result constant.

Root cause

The hook for nvmlDeviceGetMemoryInfo / _v2 already exists in src/nvml/hook.c and the dispatcher in src/libvgpu.c:126 sends nvml-prefixed dlsym lookups into __dlsym_hook_section_nvml. But that dispatcher block is wrapped in #ifdef HOOK_NVML_ENABLE, and although build.sh passes -DHOOK_NVML_ENABLE=1, that's a CMake variable — not a compiler -D. Nothing in the CMake files translates it into target_compile_definitions, so the ifdef compiles out and dlsym(handle, "nvmlDeviceGetMemoryInfo_v2") falls through to the real libnvidia-ml.so symbol.

End result: Vulkan and CUDA paths report the partitioned 23552 MiB heap (correct, this PR's existing behavior), but NVML reports the raw 46068 MiB. Isaac Sim's streaming kit consults NVML during init to plan framebuffer / encoder allocations and SegFaults when those plans collide with the actual partition.

Verification — readelf on the freshly built .so

Before this commit: readelf --dyn-syms libvgpu.so | grep nvml was empty for the memory-info hooks (they weren't being wired into the dispatcher). After: same dispatcher dispatches into _nvmlDeviceGetMemoryInfo and the LOG_DEBUG into nvmlDeviceGetMemoryInfo trace fires under LIBCUDA_LOG_LEVEL=4.

Verification — production pod on ws-node074

Built the new .so on the node, atomic-swapped /usr/local/vgpu/libvgpu.so, recreated the pod so device-plugin re-mounts the new file, then:

# inside the vscode container, container md5 matches the new build
$ md5sum /usr/local/vgpu/libvgpu.so
6291473077c45bf912f296ef1a4367b9  /usr/local/vgpu/libvgpu.so

# nvidia-smi now reports the partition value
$ nvidia-smi --query-gpu=memory.total,memory.free --format=csv
memory.total [MiB], memory.free [MiB]
23552 MiB, 23552 MiB         # was: 46068 MiB, 45458 MiB

# runheadless.sh no longer dies during streaming init
$ timeout 35 ACCEPT_EULA=y /isaac-sim/runheadless.sh
exit=124   # was: exit=139 SegFault @ 8.7s
# last line: [5.010s] [ext: omni.physx.tensors-110.0.7] startup
# crash count in log: 0   # was: 2

Repeated on a second pod (isaac-launchable-1), same result: 23552 MiB and clean init.

The fix is a 6-line CMake change. Diff for reference:

+# Activate NVML dlsym redirect (libvgpu.c:#ifdef HOOK_NVML_ENABLE).
+# Without this define the dispatcher in dlsym() falls through to the real
+# libnvidia-ml so consumers like nvidia-smi / Isaac Sim Kit see the raw
+# 46 GiB heap instead of the partitioned limit, which is inconsistent with
+# the Vulkan/CUDA paths and trips Kit asserts during streaming init.
+target_compile_definitions(${LIBVGPU} PUBLIC HOOK_NVML_ENABLE)

Existing code (_nvmlDeviceGetMemoryInfo, the v1/v2 wrappers, the dlsym table entries DLSYM_HOOK_FUNC(nvmlDeviceGetMemoryInfo) / ..._v2) is untouched — they were already correct, just unreachable. Same likely applies to the other -D…_ENABLE=1 flags in build.sh (MULTIPROCESS_LIMIT_ENABLE, HOOK_MEMINFO_ENABLE, DLSYM_HOOK_ENABLE). Out of scope for this PR but worth a follow-up audit.

@100milliongold
Copy link
Copy Markdown
Author

Step B complete — CUDA hook NULL guard hardening

Adds NULL pointer guards to CUDA hooks following the pattern from cuMemGetInfo_v2 (commit 03f99d7):

Hook Commit Change
cuMemAlloc_v2 88143ab NULL dptr forwards to driver
cuMemAllocManaged 275ba3d NULL dptr forwards before oom_check
cuMemAllocPitch_v2 01a58f1 NULL dptr/pPitch forwards before oom_check
cuMemHostRegister_v2 7dcb5a4 drop vestigial cuCtxGetDevice + NULL hptr guard
cuMemHostAlloc (tests-only 7dcb5a4) already forward-first; tests added
cuCtxGetDevice (tests-only 7dcb5a4) already passthrough; tests added
audit notes 7b76d9b docs/notes

Verification

test/test_cuda_null_guards.c — 9 unit tests, all pass under LD_PRELOAD=libvgpu.so on ws-node074. Manual NULL stress test inside isaac-launchable pod (4 hooks × NULL args) all return non-zero error codes, no SegFault. isaac-launchable namespace baseline (5/5 runheadless.sh alive) preserved.

Why

NVIDIA OptiX denoising / Aftermath / Carbonite tasking call HAMi-core hooks during Isaac Sim Kit init with NULL args during fallback probes. Without the guards, libvgpu.so dereferences NULL and SegFaults. Pattern mirrors the existing fix in commit 03f99d7 (cuMemGetInfo_v2). Step C (Vulkan layer compat hardening) follows in a separate plan.

@100milliongold
Copy link
Copy Markdown
Author

Step C redesigned — Vulkan layer split into libvgpu_vk.so

The 2026-04-28 attempt (commits 996cb22 cache+Enumerate hooks, eea2beb GIPA fallback — both reverted in this push) regressed runheadless.sh under LD_PRELOAD on ws-node074. Trace evidence in docs/superpowers/notes/2026-04-28-vk-trace-isaac-sim.md proved our layer wrappers were never called; the regression lived at the .so-load boundary. Rather than spending more diagnostic cycles on production hardware, this redesign makes that class of regression structurally impossible.

Commits (this push)

Commit Change
f52aada Revert: fix(vulkan): GIPA/GDPA fallback to cached next when instance/device unknown
83fd245 Revert: fix(vulkan): cache first next-gipa/gdpa + EnumerateDevice* via dispatch table
1118553 feat(hami-core): explicit hami_core_* export wrappers
e5812e6 refactor(vulkan): use hami_core_* wrappers instead of internal externs
b24f71c build: split Vulkan layer into separate libvgpu_vk.so
65930f4 feat(vulkan): ship hami.json implicit-layer manifest

(Plus the docs commits already on the branch documenting the trace evidence + reverts.)

What changed

  • libvgpu.so keeps NVML/CUDA hooks + allocator + multiprocess. Loses all vk* exports.
  • New libvgpu_vk.so carries the entire src/vulkan/* and exports the Vulkan layer entry points (vkGetInstanceProcAddr, vkGetDeviceProcAddr, vkNegotiateLoaderLayerInterfaceVersion). DT_NEEDED includes libvgpu.so so the linker resolves the 5 hami_core_* wrappers at Vulkan-loader dlopen() time.
  • share/hami/hami.json is the implicit-layer manifest the Step D webhook drops into /etc/vulkan/implicit_layer.d/.

Verification on ws-node074 (isaac-launchable-0)

Check Result
nm -D libvgpu.so | grep ' T vk' 0 lines ✓
nm -D libvgpu.so | grep ' T hami_core_' 5 lines ✓
readelf -d libvgpu_vk.so | grep NEEDED includes libvgpu.so
nm -D --undefined-only libvgpu_vk.so | grep hami_core_ 5 lines ✓
Step B regression test_cuda_null_guards 9/9 [OK] EXIT=0 ✓
test_alloc under LD_PRELOAD EXIT=0 ✓
LD_PRELOAD libvgpu.so w/o manifest, runheadless.sh × 5 5/5 exit=124 crash=0 listen=1
LD_PRELOAD libvgpu.so + manifest in pod, runheadless.sh × 5 5/5 alive, manifest enumerated, NVML clamp 23552 MiB ✓
HAMI_VK_TRACE lines on manifest path 0 — Kit's embedded Conan vulkan-loader path didn't traverse our GIPA in this run; partition Vulkan-side enforcement to be confirmed in Step D's 4-path verification

The first row above is the headline: the 2026-04-28 regression class is structurally gone. Production /usr/local/vgpu/libvgpu.so was restored to the pre-Step-C backup (md5 8f889313) after verification.

Out of scope for this PR

  • The original Step C tasks (cache first next-gipa, GIPA/GDPA fallback for unknown handles, EnumerateDevice* hooks for Carbonite) were reverted and stay deferred. They will return as a follow-up after the new architecture is verified end-to-end in production via Step D.
  • Visibility hardening for the vulkan_mod symbols (currently the project's build.sh forces CMAKE_BUILD_TYPE=Debug so -fvisibility=hidden doesn't apply): cleanup candidate, not blocking.

Spec / Plan

  • Spec: docs/superpowers/specs/2026-04-29-step-c-redesign-vk-so-split.md
  • Plan: docs/superpowers/plans/2026-04-29-step-c-vk-so-split.md
  • Trace evidence: docs/superpowers/notes/2026-04-28-vk-trace-isaac-sim.md

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Adds hami_vkGetPhysicalDeviceMemoryProperties[2] hooks that forward to the
next layer and then clamp each VK_MEMORY_HEAP_DEVICE_LOCAL heap size down
to the pod budget returned by hami_budget_of(). A budget of 0 is treated
as unlimited and skips clamping. A pointer-hash physdev_index() is used
provisionally; Task 1.6 replaces it with an NVML UUID lookup.

Also guards the dispatch resolver against a NULL gipa/gdpa so unit tests
can register a dispatch entry and populate function pointers manually.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…_limiter

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…r manifest

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Replaces the pointer-hash physdev_index heuristic with a proper
VkPhysicalDevice → NVML device-index mapping. Walks registered instance
dispatches to fetch VkPhysicalDeviceIDProperties.deviceUUID, then matches
it against NVML device UUIDs (nvmlDeviceGetUUID). Result cached per
VkPhysicalDevice. On unresolved devices (software rasterizer, NVML
unavailable) returns -1 and callers skip budget enforcement.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Vulkan-Headers < 1.3 (Ubuntu 20.04 libvulkan-dev) lacks PFN_vkQueueSubmit2
and VkSubmitInfo2. Guard the struct member, dispatch population, hook
wrapper and layer PFN entry so libvgpu.so builds on older header sets.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
HAMi-core's CUDA symbol trampoline is populated via cuInit() ->
preInit() -> load_cuda_libraries(). CUDA apps trigger this naturally,
but Vulkan-only apps never call cuInit, leaving oom_check without a
valid cuDeviceGetCount pointer (Hijack failed error). Call cuInit(0)
once from budget.c's public entry points to force initialisation.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…stness)

Forwards NULL dptr calls to the real CUDA driver so the caller sees the driver's defined error code (CUDA_ERROR_INVALID_VALUE) instead of HAMi dereferencing the NULL inside allocate_raw. NVIDIA OptiX/Aftermath internal init paths historically pass NULL during fallback probes; without this guard libvgpu.so SegFaults inside Isaac Sim Kit init under LD_PRELOAD. Pattern matches commit 03f99d7 (cuMemGetInfo_v2).

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…rage

Same robustness pattern as Task 2 (cuMemAlloc_v2). cuMemAllocManaged now forwards NULL dptr to the real driver to surface CUDA_ERROR_INVALID_VALUE, instead of running oom_check first which could mask the real driver error as CUDA_ERROR_OUT_OF_MEMORY when oom_check trips before the driver gets called.

cuMemAllocHost_v2 was verified safe by baseline test (forward-first pattern already returns the driver error for NULL hptr without crashing); no source change there. Test file extended with NULL-pointer cases for both functions to lock in the contract.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Forwards NULL dptr or NULL pPitch to the real CUDA driver before HAMi's oom_check can mask the driver's CUDA_ERROR_INVALID_VALUE as HAMi's CUDA_ERROR_OUT_OF_MEMORY. Pattern matches cuMemAlloc_v2 (88143ab) and cuMemAllocManaged (275ba3d).

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…cleanup

Step B Tasks 5-7 bundle:
- cuMemHostAlloc: tests only (already forward-first, no semantic change needed)
- cuMemHostRegister_v2: drop vestigial cuCtxGetDevice(&dev) (result ignored, dev unused) + add explicit NULL hptr guard for forward-first consistency
- cuCtxGetDevice: tests only (pure passthrough)

Pattern matches cuMemAlloc_v2 (commit 88143ab), cuMemAllocManaged (275ba3d),
cuMemAllocPitch_v2 (01a58f1).

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…ch table

Foundation for Step C compat hardening:

* dispatch.{h,c}: add EnumerateDeviceExtensionProperties +
  EnumerateDeviceLayerProperties function pointers to the per-instance
  dispatch struct; resolve both during hami_instance_register so the
  layer's own Enumerate* hooks can forward correctly. Add
  hami_instance_first() helper that returns the first registered
  instance dispatch under lock — used by NULL-instance Enumerate
  forwarding when the loader probes before any instance has been
  created.
* layer.c: cache the first next-layer GetInstanceProcAddr /
  GetDeviceProcAddr in static globals during CreateInstance /
  CreateDevice. Expands comments documenting the Vulkan 1.3 §38.3.1
  contract for own-name vs NULL pLayerName Enumerate semantics, and
  why an earlier draft returning LAYER_NOT_PRESENT broke
  vkCreateDevice.

This commit only restructures the existing Enumerate hooks; it does not
yet change GIPA/GDPA fallback behavior (Task 2).

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…nknown

NVIDIA driver and Carbonite probe through our GIPA/GDPA with handles
that may not yet be registered: during vkCreateInstance before our
register completes, or with upper-layer-wrapped handles. Returning
NULL there crashed the caller (SegFault inside libcarb.graphics-vulkan
when assembling the dispatch table).

Now we forward to the first-cached next_gipa/next_gdpa from a previous
CreateInstance/CreateDevice. Only when both per-handle lookup AND the
cache are absent do we return NULL — that's the legitimate
pre-CreateInstance loader bootstrap window where Enumerate* hooks have
already been matched at the top of the function.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Step C Task 5 audit. Read-only review concludes: no code change needed.

Lifetime: lookup-then-use across dropped lock relies on Vulkan 1.3 §3.6
externally-synchronized-parameters contract. VkInstance/VkDevice destroy
must be application-serialized — Carbonite and Isaac Sim Kit comply.
Same pattern as VK_LAYER_KHRONOS_validation and nvidia layers (extending
the lock across unbounded next-chain calls would deadlock).

Chain: in-place advance of chain->u.pLayerInfo->pNext is the canonical
Khronos vulkan-loader recommendation (LoaderLayerInterface.md). Loader
allocates fresh VkLayerInstanceCreateInfo per CreateInstance call; reuse
is structurally impossible. Reference layers all do in-place advance.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
… path

Step C Task 3 trace + comparative testing.

Evidence: pre-Step-C 7dcb5a4 alive under LD_PRELOAD forced. Post-Step-C
eea2beb crashes (exit=139) at NVIDIA ICD init in libGLX_nvidia.so.0
vk_icdNegotiateLoaderICDInterfaceVersion -> libEGL_nvidia.so.0 __egl_Main
-> __sigaction. HAMI_VK_TRACE=0 lines (crash before our wrappers run).

Hypothesis: Task 1 HAMI_HOOK(EnumerateDeviceExtensionProperties) +
HAMI_HOOK(EnumerateDeviceLayerProperties) intercept ICD-side global
GIPA lookup under LD_PRELOAD-only path (no manifest activation), and
return 0 entries when g_inst_head == NULL. NVIDIA driver expects the
GIPA chain to fall through to the ICD instead.

Production .so on ws-node074 restored to pre-Step-C backup
(md5 8f889313). isaac-launchable-0 confirmed alive after restore.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…unknown

Run 5 attempted to gate HAMI_HOOK(EnumerateDevice*) on g_inst_head !=
NULL. Same crash, same backtrace, same HAMI_VK_TRACE=0 lines. Hypothesis
that Step C Task 1's HAMI_HOOK additions hijack NVIDIA ICD's GIPA
lookup is FALSIFIED — our wrapper is never called yet the crash still
happens.

Differential surface narrows to .so-load-time effects (exports, static
init) rather than Vulkan wrapper logic. Further bisect blocked by
sandbox: clean rebuild of 7dcb5a4 to compare md5 against 8f889313
(production backup) was denied.

Production .so restored to 8f889313 again. isaac-launchable-0 alive
verified post-restore.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…device unknown"

This reverts commit eea2beb.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…a dispatch table"

This reverts commit 996cb22.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Five thin wrappers around the HAMi-core symbols that libvgpu_vk.so
will need after the upcoming Vulkan-layer split: oom_check,
add/rm_gpu_device_memory_usage, get_current_device_memory_limit,
rate_limiter.

All five carry __attribute__((visibility("default"))) so that the
release build (-fvisibility=hidden) keeps the export surface narrow:
libvgpu_vk.so DT_NEEDED-resolves only these names and nothing else from
HAMi-core internals. No call-site changes yet — that follows in the next
commit.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Replace the extern declarations of oom_check / add_/rm_gpu_device_
memory_usage / get_current_device_memory_limit / rate_limiter in
src/vulkan/budget.c and src/vulkan/throttle_adapter.c with calls
through the new include/hami_core_export.h interface.

This is a pure call-site rewrite — same runtime behavior, same .so
boundary (still linked into one libvgpu.so for now). The point is to
remove direct dependence on HAMi-core internal symbol names so the
upcoming libvgpu_vk.so split can keep DT_NEEDED narrow.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
libvgpu.so loses vulkan_mod and now contains only HAMi-core
(NVML/CUDA hooks + allocator + multiprocess). libvgpu_vk.so is a new
shared target that holds all of src/vulkan/* and links libvgpu.so as
DT_NEEDED so the hami_core_* wrappers resolve when the Vulkan loader
dlopen()s the new .so via the implicit-layer manifest.

After this commit:
* nm -D libvgpu.so MUST NOT show vk*
* nm -D libvgpu_vk.so MUST show vkGetInstanceProcAddr,
  vkGetDeviceProcAddr, vkNegotiateLoaderLayerInterfaceVersion (and only
  those as exports thanks to -fvisibility=hidden + HAMI_LAYER_EXPORT).
* readelf -d libvgpu_vk.so MUST list libvgpu.so as NEEDED.

Step C plan: docs/superpowers/plans/2026-04-29-step-c-vk-so-split.md
Spec: docs/superpowers/specs/2026-04-29-step-c-redesign-vk-so-split.md

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Static manifest that the Step D webhook + DaemonSet will install
into /etc/vulkan/implicit_layer.d/ to activate libvgpu_vk.so via the
Vulkan loader. file_format_version 1.0.0, type INSTANCE, api 1.3.0.

library_path is the production install path /usr/local/vgpu/libvgpu_vk.so;
no extensions claimed (the layer only intercepts existing entry points).

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…ch table

Foundation for Step C compat hardening:

* dispatch.{h,c}: add EnumerateDeviceExtensionProperties +
  EnumerateDeviceLayerProperties function pointers to the per-instance
  dispatch struct; resolve both during hami_instance_register so the
  layer's own Enumerate* hooks can forward correctly. Add
  hami_instance_first() helper that returns the first registered
  instance dispatch under lock — used by NULL-instance Enumerate
  forwarding when the loader probes before any instance has been
  created.
* layer.c: cache the first next-layer GetInstanceProcAddr /
  GetDeviceProcAddr in static globals during CreateInstance /
  CreateDevice. Expands comments documenting the Vulkan 1.3 §38.3.1
  contract for own-name vs NULL pLayerName Enumerate semantics, and
  why an earlier draft returning LAYER_NOT_PRESENT broke
  vkCreateDevice.

This commit only restructures the existing Enumerate hooks; it does not
yet change GIPA/GDPA fallback behavior (Task 2).

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
…nknown

NVIDIA driver and Carbonite probe through our GIPA/GDPA with handles
that may not yet be registered: during vkCreateInstance before our
register completes, or with upper-layer-wrapped handles. Returning
NULL there crashed the caller (SegFault inside libcarb.graphics-vulkan
when assembling the dispatch table).

Now we forward to the first-cached next_gipa/next_gdpa from a previous
CreateInstance/CreateDevice. Only when both per-handle lookup AND the
cache are absent do we return NULL — that's the legitimate
pre-CreateInstance loader bootstrap window where Enumerate* hooks have
already been matched at the top of the function.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
cpplint flagged src/vulkan/layer.c:331 at 126 chars. Split the
HAMI_TRACE format string and arg onto two lines so the longest line is
~107 chars.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
@100milliongold
Copy link
Copy Markdown
Author

/retest

(Build libvgpu CI failure is a transient apt mirror network issue — archive.ubuntu.com returned Connection failed for libbsd0, breaking cmake install. Code itself never compiled. Please retest.)

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented May 4, 2026

@100milliongold: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

(Build libvgpu CI failure is a transient apt mirror network issue — archive.ubuntu.com returned Connection failed for libbsd0, breaking cmake install. Code itself never compiled. Please retest.)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…te heapUsage on budget query

Two related fixes for VK_EXT_memory_budget on Carbonite/Kit and similar
engines that target a Vulkan 1.0 instance with the KHR-extension form of
properties2:

(1) Hook the KHR alias of vkGetPhysicalDeviceMemoryProperties2

    The HAMI_HOOK macro previously matched only the core (Vulkan 1.1+)
    name. Loaders queried our layer's GetInstanceProcAddr for the KHR
    alias `vkGetPhysicalDeviceMemoryProperties2KHR` and fell through to
    the next layer / ICD when we returned NULL. Result: clamp_heaps was
    never invoked on those calls, so engines saw the host GPU's full
    heap size (~45 GiB) instead of the partition limit. Add a
    HAMI_HOOK_KHR_ALIAS macro and wire it for the memory-properties2
    entry point so the same hami_vk* function services both names.

(2) Inflate heapUsage rather than clamping heapBudget

    Engines that consume VK_EXT_memory_budget render an on-screen
    "X used / Y available" overlay computed as heapBudget - heapUsage.
    Clamping heapBudget to the partition limit produces the right
    "available" value but causes omni.physx.tensors to deadlock during
    plugin initialization (Isaac Sim Streaming 6.0, NVIDIA 580.142).
    The deadlock persists regardless of whether heapBudget is clamped
    to partition_limit, partition_limit-heapUsage, or any value below
    heap.size — PhysX/Carbonite consumes the absolute heapBudget
    through paths beyond the simple subtraction.

    Workaround: leave heapBudget at the ICD-reported value (host GPU's
    free memory) and inflate heapUsage by (icd_budget - partition_limit).
    The overlay's available calculation now equals
    heapBudget - (icd_usage + delta) = partition_limit - icd_usage,
    matching the partition. PhysX is unaffected because it does not
    consult heapUsage.

Verified end-to-end on isaac-launchable-0 (23 GiB partition,
ws-node074):

  Pre-fix:  size=44.99 GiB heapBudget=32.12 GiB
            overlay "1.3 used / 32.2 available"     [false]
  Post-fix: size=23.00 GiB heapBudget=33.4  GiB heapUsage=11.5 GiB
            overlay "11.5 used / 21.9 available"    [21.9 = 23 - 1.1]
            omni.physx.tensors initializes cleanly; Kit reaches the
            streaming-ready idle state at 60 fps.

Signed-off-by: Jea-Eok-Kim <je.kim@xiilab.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants