Skip to content

fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure#188

Open
aamsellem wants to merge 2 commits into
Project-HAMi:mainfrom
aamsellem:fix/cumemcreate-uninit-dev
Open

fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure#188
aamsellem wants to merge 2 commits into
Project-HAMi:mainfrom
aamsellem:fix/cumemcreate-uninit-dev

Conversation

@aamsellem
Copy link
Copy Markdown

@aamsellem aamsellem commented Apr 28, 2026

Fixes #187.

What

cuCtxGetDevice(&dev) is called in 6 different hooks across src/cuda/memory.c and src/allocator/allocator.c without checking the return value. Whenever cuCtxGetDevice fails (e.g. a CUDA call comes from a thread that hasn't pushed a context yet, which is common in async / multi-thread engines), dev is left uninitialised on the stack and forwarded to:

  • oom_check(dev, size) → reads garbage as device id, indexes into shared region
  • add_chunk_only(*handle, size, dev) → stores garbage as the entry's device
  • add_gpu_device_memory_usage(getpid(), dev, allocsize, 2) → indexes into shared region
  • rm_gpu_device_memory_usage(getpid(), dev, t_size, 2) → same
  • set_current_device_memory_limit(dev, newlimit) → logs Illegal device id: <random> and writes OOB into region_info.shared_region->limit[dev]

The visible symptom in pod logs is:

[HAMI-core ERROR ...]: Illegal device id: -644371744

with the random integer changing between runs. The silent symptom is shared-region corruption that can affect other processes attached to the same region_info.

This blocks any container that does multi-threaded / early CUDA allocation under HAMi — notably ggml-cuda (used by llama.cpp llama-server, the Lucebox / DFlash speculative-decoding stack), and any vLLM build with async scheduling.

How

Two commits.

1. cuMemCreate (src/cuda/memory.c) — initialise dev from prop->location.id up-front, and only call add_chunk_only when the allocation is actually DEVICE-pinned (do_oom_check == true). Non-DEVICE allocations don't consume per-device VRAM so they shouldn't count against per-device limits.

Also tightens set_current_device_memory_limit: it now returns -1 early instead of falling through to an OOB write when dev is out of range.

2. src/allocator/allocator.c — same pattern fix in 5 more sites:

Site Behaviour
oom_check (dev == -1 branch) bail out gracefully, return 0 (skip OOM accounting)
add_chunk bail out, return CUDA_SUCCESS (alloc still tracked by the lib)
remove_chunk skip per-device usage tracking, leave the lib free intact
remove_chunk_async same
add_chunk_async bail out, return CUDA_SUCCESS

Strategy is the same in all five: initialise dev = -1, check the return code of cuCtxGetDevice, and on failure skip per-device tracking gracefully. The underlying CUDA allocation/free still happens; we just don't bill it against a device limit if we don't know which device it belongs to. Safer than making up a device id and writing OOB into the shared region.

Also drops two redundant cuCtxGetDevice() calls in add_chunk and add_chunk_asyncdev is already set at the top of each function, re-querying after the allocation just multiplies the chance of the same race.

Test

Reproducer in #187. Quick manual check:

// repro.c — should print 0 (CUDA_SUCCESS) and not log "Illegal device id"
#include <stdio.h>
#include <cuda.h>
int main(void) {
    cuInit(0);
    CUdevice device; cuDeviceGet(&device, 0);
    CUcontext ctx;   cuCtxCreate(&ctx, 0, device);
    CUmemAllocationProp prop = {0};
    prop.type             = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type    = CU_MEM_LOCATION_TYPE_HOST_NUMA;
    prop.location.id      = 0;
    size_t granularity;
    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
    size_t size = ((1<<20) + granularity - 1) / granularity * granularity;
    CUmemGenericAllocationHandle handle;
    CUresult res = cuMemCreate(&handle, size, &prop, 0);
    printf("cuMemCreate returned %d\n", res);
    cuCtxDestroy(ctx);
    return 0;
}

Build: nvcc -lcuda repro.c -o repro. Run inside a HAMi-managed pod.

  • Before: [HAMI-core ERROR ...]: Illegal device id: <random> then cuMemCreate returned 0.
  • After: just cuMemCreate returned 0. No HAMi error, no OOB write.

End-to-end check: aamsellem/lucebox-qwen36-blackwell:1.0.0 pod boots through ggml_cuda_init and reaches llama-server is listening on a sm_120 GPU, instead of crash-looping at the first VMM allocation.

Notes

  • Behavioural change for cuMemCreate: non-DEVICE allocations are no longer added to the chunk tracker. They never consumed device VRAM in the first place, so this lines up with what HAMi actually bills against. If maintainers prefer to track them in a separate accounting path instead, happy to rework — just say the word.
  • Behavioural change for the allocator hooks: when cuCtxGetDevice fails, the per-device accounting is skipped, but the underlying CUDA call still runs. The shared region stays consistent rather than being written OOB. If maintainers prefer a different default (e.g. fall back to device 0), happy to switch.
  • The bounds-check fix in set_current_device_memory_limit is technically a separate concern. I kept it in the same PR because it's a 1-line addition and the symptom (Illegal device id log followed by silent shared-region corruption) was confusing me during the initial debug. Easy to split if preferred.
  • The same OOB-after-LOG_ERROR pattern still exists in the symmetric getters/setters (get_current_device_sm_limit, set_current_device_sm_limit_scale, get_current_device_memory_limit, get_current_device_memory_monitor, get_current_device_memory_usage). Those weren't on the crash path I was tracing so I left them alone — happy to do them in this PR or as a follow-up, just let me know.
  • Tested on HAMi 2.5.1 / HAMi-core master at commit b6afdd0, driver 580.65.06, CUDA 13.0.48, RTX 5090 Laptop GPU (sm_120). Should be hardware-agnostic.

Checklist

…VICE allocations

cuMemCreate left `dev` uninitialized when `prop->location.type !=
CU_MEM_LOCATION_TYPE_DEVICE`, then unconditionally forwarded the stack
garbage to add_chunk_only -> set_current_device_memory_limit, which
logged "Illegal device id: <random>" and wrote out of bounds into
region_info.shared_region->limit[dev].

This blocked any container using CUDA Driver API VMM with non-DEVICE
allocation locations - notably ggml-cuda (used by llama.cpp,
llama-server, and downstream forks like Lucebox), which reserves a
virtual address pool with CU_MEM_LOCATION_TYPE_HOST_NUMA before
binding to a specific device.

Two small changes:

1. cuMemCreate now initializes `dev` from `prop->location.id` up-front,
   and only calls add_chunk_only when the allocation is actually
   device-pinned (do_oom_check). Non-DEVICE allocations no longer count
   against per-device memory limits, since they don't consume device
   VRAM anyway.

2. set_current_device_memory_limit now returns -1 instead of falling
   through to an OOB write when dev is out of range.

Fixes Project-HAMi#187

Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com>
@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aamsellem
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 28, 2026

Welcome @aamsellem! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

@hami-robot hami-robot Bot added the size/XS label Apr 28, 2026
Same uninitialised-dev pattern as the cuMemCreate fix in the previous
commit, but spread across allocator.c. All five callers used to call
cuCtxGetDevice(&dev) and ignore the return code, then forwarded the
stack-garbage `dev` to oom_check / add_gpu_device_memory_usage /
rm_gpu_device_memory_usage / get_current_device_memory_limit, which
ends up either logging "Illegal device id: <random>" or making
out-of-bounds reads/writes against region_info.shared_region.

Sites fixed:

- oom_check (dev == -1 branch)
- add_chunk
- remove_chunk
- remove_chunk_async
- add_chunk_async

Strategy is the same in all five: initialise dev to -1, check the
return code of cuCtxGetDevice, and on failure skip the per-device
tracking gracefully (the underlying CUDA allocation still succeeded;
we just can't bill it against a device limit if we don't know which
device it belongs to). This is safer than making up a device id and
writing OOB into the shared region.

Also drops two redundant cuCtxGetDevice() calls in add_chunk and
add_chunk_async — `dev` is already set at the top of each function,
re-querying after the allocation just multiplies the chance of the
same race we're trying to fix.

This pattern is what was actually crashing inference engines that do
multi-thread early-allocation under HAMi (llama.cpp / ggml-cuda /
Lucebox / vLLM). Fixing only cuMemCreate (the previous commit) would
have left the allocator hooks broken on the same path.

Refs Project-HAMi#187

Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com>
@hami-robot hami-robot Bot added size/M and removed size/XS labels Apr 28, 2026
@aamsellem aamsellem changed the title fix(cuda): initialize dev in cuMemCreate hook to avoid OOB on non-DEVICE allocations fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure Apr 28, 2026
@archlitchi
Copy link
Copy Markdown
Member

Hi, this illegal device ID issue should have fixed in v2.8.1, have you tried that?

@aamsellem
Copy link
Copy Markdown
Author

Hi @archlitchi, thanks for the quick reply ! Just to make sure we're talking about the same thing — v2.8.1 refers to the Project-HAMi/HAMi wrapper release, right ? HAMi-core itself doesn't have releases, so I checked the master branch directly.

The bug is still present on Project-HAMi/HAMi-core master at the current HEAD (e3c3477). Lines 587-602 of src/cuda/memory.c :

CUresult cuMemCreate ( CUmemGenericAllocationHandle* handle, size_t size, const CUmemAllocationProp* prop, unsigned long long flags ) {
    LOG_INFO("cuMemCreate:%lld:%d", size, prop->location.id);
    ENSURE_RUNNING();
    CUdevice dev;                                          // ← still uninitialized
    int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
    if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
        dev = prop->location.id;
    }
    // ...
    if (res == CUDA_SUCCESS) {
        add_chunk_only(*handle, size, dev);                // ← uninit dev when do_oom_check == false
    }

When prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE (e.g. ggml-cuda using CU_MEM_LOCATION_TYPE_HOST_NUMA for its virtual address pool), do_oom_check == false, so the if (...) block that initialises dev is skipped entirely. Then add_chunk_only(*handle, size, dev) is called with whatever was on the stack.

Same pattern in 5 more sites in src/allocator/allocator.c (lines 39, 106, 172, 229, 249, 276) — all calling cuCtxGetDevice(&dev) with the return code ignored. If cuCtxGetDevice fails (which happens in async / multi-thread / early-init paths), dev ends up forwarded to set_current_device_memory_limitIllegal device id: <random> + OOB write into region_info.shared_region->limit[dev].

The minimal reproducer in #187 triggers the bug deterministically against current master. I built it against e3c3477 to double-check before responding here.

Could you point me to the specific commit / PR that you think fixes this on HAMi (the wrapper) ? If there's a related fix on the wrapper side I'm happy to align my approach, but the uninitialised stack read in HAMi-core looks distinct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: cuMemCreate hook reads uninitialized dev when prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE

2 participants