fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure#188
Conversation
…VICE allocations cuMemCreate left `dev` uninitialized when `prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE`, then unconditionally forwarded the stack garbage to add_chunk_only -> set_current_device_memory_limit, which logged "Illegal device id: <random>" and wrote out of bounds into region_info.shared_region->limit[dev]. This blocked any container using CUDA Driver API VMM with non-DEVICE allocation locations - notably ggml-cuda (used by llama.cpp, llama-server, and downstream forks like Lucebox), which reserves a virtual address pool with CU_MEM_LOCATION_TYPE_HOST_NUMA before binding to a specific device. Two small changes: 1. cuMemCreate now initializes `dev` from `prop->location.id` up-front, and only calls add_chunk_only when the allocation is actually device-pinned (do_oom_check). Non-DEVICE allocations no longer count against per-device memory limits, since they don't consume device VRAM anyway. 2. set_current_device_memory_limit now returns -1 instead of falling through to an OOB write when dev is out of range. Fixes Project-HAMi#187 Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aamsellem The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @aamsellem! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉 |
Same uninitialised-dev pattern as the cuMemCreate fix in the previous commit, but spread across allocator.c. All five callers used to call cuCtxGetDevice(&dev) and ignore the return code, then forwarded the stack-garbage `dev` to oom_check / add_gpu_device_memory_usage / rm_gpu_device_memory_usage / get_current_device_memory_limit, which ends up either logging "Illegal device id: <random>" or making out-of-bounds reads/writes against region_info.shared_region. Sites fixed: - oom_check (dev == -1 branch) - add_chunk - remove_chunk - remove_chunk_async - add_chunk_async Strategy is the same in all five: initialise dev to -1, check the return code of cuCtxGetDevice, and on failure skip the per-device tracking gracefully (the underlying CUDA allocation still succeeded; we just can't bill it against a device limit if we don't know which device it belongs to). This is safer than making up a device id and writing OOB into the shared region. Also drops two redundant cuCtxGetDevice() calls in add_chunk and add_chunk_async — `dev` is already set at the top of each function, re-querying after the allocation just multiplies the chance of the same race we're trying to fix. This pattern is what was actually crashing inference engines that do multi-thread early-allocation under HAMi (llama.cpp / ggml-cuda / Lucebox / vLLM). Fixing only cuMemCreate (the previous commit) would have left the allocator hooks broken on the same path. Refs Project-HAMi#187 Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com>
|
Hi, this illegal device ID issue should have fixed in v2.8.1, have you tried that? |
|
Hi @archlitchi, thanks for the quick reply ! Just to make sure we're talking about the same thing — The bug is still present on CUresult cuMemCreate ( CUmemGenericAllocationHandle* handle, size_t size, const CUmemAllocationProp* prop, unsigned long long flags ) {
LOG_INFO("cuMemCreate:%lld:%d", size, prop->location.id);
ENSURE_RUNNING();
CUdevice dev; // ← still uninitialized
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
dev = prop->location.id;
}
// ...
if (res == CUDA_SUCCESS) {
add_chunk_only(*handle, size, dev); // ← uninit dev when do_oom_check == false
}When Same pattern in 5 more sites in The minimal reproducer in #187 triggers the bug deterministically against current master. I built it against Could you point me to the specific commit / PR that you think fixes this on HAMi (the wrapper) ? If there's a related fix on the wrapper side I'm happy to align my approach, but the uninitialised stack read in HAMi-core looks distinct. |
Fixes #187.
What
cuCtxGetDevice(&dev)is called in 6 different hooks acrosssrc/cuda/memory.candsrc/allocator/allocator.cwithout checking the return value. WhenevercuCtxGetDevicefails (e.g. a CUDA call comes from a thread that hasn't pushed a context yet, which is common in async / multi-thread engines),devis left uninitialised on the stack and forwarded to:oom_check(dev, size)→ reads garbage as device id, indexes into shared regionadd_chunk_only(*handle, size, dev)→ stores garbage as the entry's deviceadd_gpu_device_memory_usage(getpid(), dev, allocsize, 2)→ indexes into shared regionrm_gpu_device_memory_usage(getpid(), dev, t_size, 2)→ sameset_current_device_memory_limit(dev, newlimit)→ logsIllegal device id: <random>and writes OOB intoregion_info.shared_region->limit[dev]The visible symptom in pod logs is:
with the random integer changing between runs. The silent symptom is shared-region corruption that can affect other processes attached to the same
region_info.This blocks any container that does multi-threaded / early CUDA allocation under HAMi — notably
ggml-cuda(used byllama.cppllama-server, the Lucebox / DFlash speculative-decoding stack), and any vLLM build with async scheduling.How
Two commits.
1.
cuMemCreate(src/cuda/memory.c) — initialisedevfromprop->location.idup-front, and only calladd_chunk_onlywhen the allocation is actually DEVICE-pinned (do_oom_check == true). Non-DEVICE allocations don't consume per-device VRAM so they shouldn't count against per-device limits.Also tightens
set_current_device_memory_limit: it now returns-1early instead of falling through to an OOB write whendevis out of range.2.
src/allocator/allocator.c— same pattern fix in 5 more sites:oom_check(dev == -1 branch)add_chunkremove_chunkremove_chunk_asyncadd_chunk_asyncStrategy is the same in all five: initialise
dev = -1, check the return code ofcuCtxGetDevice, and on failure skip per-device tracking gracefully. The underlying CUDA allocation/free still happens; we just don't bill it against a device limit if we don't know which device it belongs to. Safer than making up a device id and writing OOB into the shared region.Also drops two redundant
cuCtxGetDevice()calls inadd_chunkandadd_chunk_async—devis already set at the top of each function, re-querying after the allocation just multiplies the chance of the same race.Test
Reproducer in #187. Quick manual check:
Build:
nvcc -lcuda repro.c -o repro. Run inside a HAMi-managed pod.[HAMI-core ERROR ...]: Illegal device id: <random>thencuMemCreate returned 0.cuMemCreate returned 0. No HAMi error, no OOB write.End-to-end check:
aamsellem/lucebox-qwen36-blackwell:1.0.0pod boots throughggml_cuda_initand reachesllama-server is listeningon a sm_120 GPU, instead of crash-looping at the first VMM allocation.Notes
cuMemCreate: non-DEVICE allocations are no longer added to the chunk tracker. They never consumed device VRAM in the first place, so this lines up with what HAMi actually bills against. If maintainers prefer to track them in a separate accounting path instead, happy to rework — just say the word.cuCtxGetDevicefails, the per-device accounting is skipped, but the underlying CUDA call still runs. The shared region stays consistent rather than being written OOB. If maintainers prefer a different default (e.g. fall back to device 0), happy to switch.set_current_device_memory_limitis technically a separate concern. I kept it in the same PR because it's a 1-line addition and the symptom (Illegal device idlog followed by silent shared-region corruption) was confusing me during the initial debug. Easy to split if preferred.get_current_device_sm_limit,set_current_device_sm_limit_scale,get_current_device_memory_limit,get_current_device_memory_monitor,get_current_device_memory_usage). Those weren't on the crash path I was tracing so I left them alone — happy to do them in this PR or as a follow-up, just let me know.b6afdd0, driver 580.65.06, CUDA 13.0.48, RTX 5090 Laptop GPU (sm_120). Should be hardware-agnostic.Checklist
src/multiprocess/andsrc/allocator/don't currently ship with a test harness in this repo. Happy to add one if there's a pattern to follow.