fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure by aamsellem · Pull Request #188 · Project-HAMi/HAMi-core

aamsellem · 2026-04-28T19:33:21Z

Fixes #187.

What

cuCtxGetDevice(&dev) is called in 6 different hooks across src/cuda/memory.c and src/allocator/allocator.c without checking the return value. Whenever cuCtxGetDevice fails (e.g. a CUDA call comes from a thread that hasn't pushed a context yet, which is common in async / multi-thread engines), dev is left uninitialised on the stack and forwarded to:

oom_check(dev, size) → reads garbage as device id, indexes into shared region
add_chunk_only(*handle, size, dev) → stores garbage as the entry's device
add_gpu_device_memory_usage(getpid(), dev, allocsize, 2) → indexes into shared region
rm_gpu_device_memory_usage(getpid(), dev, t_size, 2) → same
set_current_device_memory_limit(dev, newlimit) → logs Illegal device id: <random> and writes OOB into region_info.shared_region->limit[dev]

The visible symptom in pod logs is:

[HAMI-core ERROR ...]: Illegal device id: -644371744

with the random integer changing between runs. The silent symptom is shared-region corruption that can affect other processes attached to the same region_info.

This blocks any container that does multi-threaded / early CUDA allocation under HAMi — notably ggml-cuda (used by llama.cpp llama-server, the Lucebox / DFlash speculative-decoding stack), and any vLLM build with async scheduling.

How

Two commits.

1. cuMemCreate (src/cuda/memory.c) — initialise dev from prop->location.id up-front, and only call add_chunk_only when the allocation is actually DEVICE-pinned (do_oom_check == true). Non-DEVICE allocations don't consume per-device VRAM so they shouldn't count against per-device limits.

Also tightens set_current_device_memory_limit: it now returns -1 early instead of falling through to an OOB write when dev is out of range.

2. src/allocator/allocator.c — same pattern fix in 5 more sites:

Site	Behaviour
`oom_check` (dev == -1 branch)	bail out gracefully, return 0 (skip OOM accounting)
`add_chunk`	bail out, return CUDA_SUCCESS (alloc still tracked by the lib)
`remove_chunk`	skip per-device usage tracking, leave the lib free intact
`remove_chunk_async`	same
`add_chunk_async`	bail out, return CUDA_SUCCESS

Strategy is the same in all five: initialise dev = -1, check the return code of cuCtxGetDevice, and on failure skip per-device tracking gracefully. The underlying CUDA allocation/free still happens; we just don't bill it against a device limit if we don't know which device it belongs to. Safer than making up a device id and writing OOB into the shared region.

Also drops two redundant cuCtxGetDevice() calls in add_chunk and add_chunk_async — dev is already set at the top of each function, re-querying after the allocation just multiplies the chance of the same race.

Test

Reproducer in #187. Quick manual check:

// repro.c — should print 0 (CUDA_SUCCESS) and not log "Illegal device id"
#include <stdio.h>
#include <cuda.h>
int main(void) {
    cuInit(0);
    CUdevice device; cuDeviceGet(&device, 0);
    CUcontext ctx;   cuCtxCreate(&ctx, 0, device);
    CUmemAllocationProp prop = {0};
    prop.type             = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type    = CU_MEM_LOCATION_TYPE_HOST_NUMA;
    prop.location.id      = 0;
    size_t granularity;
    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
    size_t size = ((1<<20) + granularity - 1) / granularity * granularity;
    CUmemGenericAllocationHandle handle;
    CUresult res = cuMemCreate(&handle, size, &prop, 0);
    printf("cuMemCreate returned %d\n", res);
    cuCtxDestroy(ctx);
    return 0;
}

Build: nvcc -lcuda repro.c -o repro. Run inside a HAMi-managed pod.

Before: [HAMI-core ERROR ...]: Illegal device id: <random> then cuMemCreate returned 0.
After: just cuMemCreate returned 0. No HAMi error, no OOB write.

End-to-end check: aamsellem/lucebox-qwen36-blackwell:1.0.0 pod boots through ggml_cuda_init and reaches llama-server is listening on a sm_120 GPU, instead of crash-looping at the first VMM allocation.

Notes

Behavioural change for cuMemCreate: non-DEVICE allocations are no longer added to the chunk tracker. They never consumed device VRAM in the first place, so this lines up with what HAMi actually bills against. If maintainers prefer to track them in a separate accounting path instead, happy to rework — just say the word.
Behavioural change for the allocator hooks: when cuCtxGetDevice fails, the per-device accounting is skipped, but the underlying CUDA call still runs. The shared region stays consistent rather than being written OOB. If maintainers prefer a different default (e.g. fall back to device 0), happy to switch.
The bounds-check fix in set_current_device_memory_limit is technically a separate concern. I kept it in the same PR because it's a 1-line addition and the symptom (Illegal device id log followed by silent shared-region corruption) was confusing me during the initial debug. Easy to split if preferred.
The same OOB-after-LOG_ERROR pattern still exists in the symmetric getters/setters (get_current_device_sm_limit, set_current_device_sm_limit_scale, get_current_device_memory_limit, get_current_device_memory_monitor, get_current_device_memory_usage). Those weren't on the crash path I was tracing so I left them alone — happy to do them in this PR or as a follow-up, just let me know.
Tested on HAMi 2.5.1 / HAMi-core master at commit b6afdd0, driver 580.65.06, CUDA 13.0.48, RTX 5090 Laptop GPU (sm_120). Should be hardware-agnostic.

Checklist

Tested on HAMi-managed pod with both VMM-enabled and VMM-disabled builds — VMM-enabled now boots, VMM-disabled unaffected.
Repro pre-fix and post-fix documented in Bug: cuMemCreate hook reads uninitialized dev when prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE #187.
No new tests added — src/multiprocess/ and src/allocator/ don't currently ship with a test harness in this repo. Happy to add one if there's a pattern to follow.

…VICE allocations cuMemCreate left `dev` uninitialized when `prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE`, then unconditionally forwarded the stack garbage to add_chunk_only -> set_current_device_memory_limit, which logged "Illegal device id: <random>" and wrote out of bounds into region_info.shared_region->limit[dev]. This blocked any container using CUDA Driver API VMM with non-DEVICE allocation locations - notably ggml-cuda (used by llama.cpp, llama-server, and downstream forks like Lucebox), which reserves a virtual address pool with CU_MEM_LOCATION_TYPE_HOST_NUMA before binding to a specific device. Two small changes: 1. cuMemCreate now initializes `dev` from `prop->location.id` up-front, and only calls add_chunk_only when the allocation is actually device-pinned (do_oom_check). Non-DEVICE allocations no longer count against per-device memory limits, since they don't consume device VRAM anyway. 2. set_current_device_memory_limit now returns -1 instead of falling through to an OOB write when dev is out of range. Fixes Project-HAMi#187 Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com>

hami-robot · 2026-04-28T19:33:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aamsellem
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2026-04-28T19:33:30Z

Welcome @aamsellem! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

Same uninitialised-dev pattern as the cuMemCreate fix in the previous commit, but spread across allocator.c. All five callers used to call cuCtxGetDevice(&dev) and ignore the return code, then forwarded the stack-garbage `dev` to oom_check / add_gpu_device_memory_usage / rm_gpu_device_memory_usage / get_current_device_memory_limit, which ends up either logging "Illegal device id: <random>" or making out-of-bounds reads/writes against region_info.shared_region. Sites fixed: - oom_check (dev == -1 branch) - add_chunk - remove_chunk - remove_chunk_async - add_chunk_async Strategy is the same in all five: initialise dev to -1, check the return code of cuCtxGetDevice, and on failure skip the per-device tracking gracefully (the underlying CUDA allocation still succeeded; we just can't bill it against a device limit if we don't know which device it belongs to). This is safer than making up a device id and writing OOB into the shared region. Also drops two redundant cuCtxGetDevice() calls in add_chunk and add_chunk_async — `dev` is already set at the top of each function, re-querying after the allocation just multiplies the chance of the same race we're trying to fix. This pattern is what was actually crashing inference engines that do multi-thread early-allocation under HAMi (llama.cpp / ggml-cuda / Lucebox / vLLM). Fixing only cuMemCreate (the previous commit) would have left the allocator hooks broken on the same path. Refs Project-HAMi#187 Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com>

archlitchi · 2026-04-29T09:15:31Z

Hi, this illegal device ID issue should have fixed in v2.8.1, have you tried that?

aamsellem · 2026-04-29T09:44:33Z

Hi @archlitchi, thanks for the quick reply ! Just to make sure we're talking about the same thing — v2.8.1 refers to the Project-HAMi/HAMi wrapper release, right ? HAMi-core itself doesn't have releases, so I checked the master branch directly.

The bug is still present on Project-HAMi/HAMi-core master at the current HEAD (e3c3477). Lines 587-602 of src/cuda/memory.c :

CUresult cuMemCreate ( CUmemGenericAllocationHandle* handle, size_t size, const CUmemAllocationProp* prop, unsigned long long flags ) {
    LOG_INFO("cuMemCreate:%lld:%d", size, prop->location.id);
    ENSURE_RUNNING();
    CUdevice dev;                                          // ← still uninitialized
    int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
    if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
        dev = prop->location.id;
    }
    // ...
    if (res == CUDA_SUCCESS) {
        add_chunk_only(*handle, size, dev);                // ← uninit dev when do_oom_check == false
    }

When prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE (e.g. ggml-cuda using CU_MEM_LOCATION_TYPE_HOST_NUMA for its virtual address pool), do_oom_check == false, so the if (...) block that initialises dev is skipped entirely. Then add_chunk_only(*handle, size, dev) is called with whatever was on the stack.

Same pattern in 5 more sites in src/allocator/allocator.c (lines 39, 106, 172, 229, 249, 276) — all calling cuCtxGetDevice(&dev) with the return code ignored. If cuCtxGetDevice fails (which happens in async / multi-thread / early-init paths), dev ends up forwarded to set_current_device_memory_limit → Illegal device id: <random> + OOB write into region_info.shared_region->limit[dev].

The minimal reproducer in #187 triggers the bug deterministically against current master. I built it against e3c3477 to double-check before responding here.

Could you point me to the specific commit / PR that you think fixes this on HAMi (the wrapper) ? If there's a related fix on the wrapper side I'm happy to align my approach, but the uninitialised stack read in HAMi-core looks distinct.

hami-robot Bot added the dco-signoff: yes label Apr 28, 2026

hami-robot Bot requested review from archlitchi and chaunceyjiang April 28, 2026 19:33

hami-robot Bot added the size/XS label Apr 28, 2026

hami-robot Bot added size/M and removed size/XS labels Apr 28, 2026

aamsellem changed the title ~~fix(cuda): initialize dev in cuMemCreate hook to avoid OOB on non-DEVICE allocations~~ fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure#188

fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure#188
aamsellem wants to merge 2 commits into
Project-HAMi:mainfrom
aamsellem:fix/cumemcreate-uninit-dev

aamsellem commented Apr 28, 2026 •

edited

Loading

Uh oh!

hami-robot Bot commented Apr 28, 2026

Uh oh!

hami-robot Bot commented Apr 28, 2026

Uh oh!

archlitchi commented Apr 29, 2026

Uh oh!

aamsellem commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aamsellem commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Test

Notes

Checklist

Uh oh!

hami-robot Bot commented Apr 28, 2026

Uh oh!

hami-robot Bot commented Apr 28, 2026

Uh oh!

archlitchi commented Apr 29, 2026

Uh oh!

aamsellem commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aamsellem commented Apr 28, 2026 •

edited

Loading