fix(memory_limit): treat "0m" as no-limit (consistent with SM_LIMIT=0)#190
fix(memory_limit): treat "0m" as no-limit (consistent with SM_LIMIT=0)#190aamsellem wants to merge 2 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aamsellem The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
`get_limit_from_env()` parses the per-device memory cap from
`CUDA_DEVICE_MEMORY_LIMIT_<idx>`. The branch that handles a parsed
value of 0 already treats SM_LIMIT=0 as "no limit" with an INFO log,
but treated MEMORY_LIMIT=0 as invalid with a WARN log. The caller
(`do_init_device_memory_limits`) sets `arr[i] = 0` in both cases,
which downstream is interpreted as "no per-pod cap" — there is no
"reject" path. So the WARN was misleading, and crucially the path
that runs after the warning didn't always fall through to the NVML
query of actual device memory.
Concretely on the Olares (k0s + bytetrade device-plugin) deployment,
the plugin sets `CUDA_DEVICE_MEMORY_LIMIT_<idx>=0m` by default when
no `nvidia.com/gpumem` request is set on the pod. With the previous
"invalid" branch the host-side memory total was reported back as
0 MiB on the Driver API path (cuMemAlloc / cuMemoryAllocate),
producing immediate `res=2` (CUDA_ERROR_OUT_OF_MEMORY) on the first
allocation:
[HAMI-core Warn]: invalid device memory limit CUDA_DEVICE_MEMORY_LIMIT_0=0m
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 0 MiB):
[HAMI-core ERROR allocator.c:126]: cuMemoryAllocate failed res=2
The Runtime API path (cudaMalloc) was not impacted by the same code
path and continued to report the correct memory size; this is why
some workloads on the same pod template worked while others (the
ones that go through the Driver API VMM allocator) did not.
This change normalises the two branches: 0m / 0g / 0k / 0 are all
valid no-limit encodings for memory just as they are for SM, with an
INFO log instead of WARN. No semantic change for non-zero values; no
change to caller code.
Tested by:
- Repro on Olares One (RTX 5090 Laptop, sm_120) where a Lucebox DFlash
workload using cuMemoryAllocate failed with `Total VRAM: 0 MiB`
before this change. After the change the same workload sees the
full 24 GiB and runs to completion.
- Other pods on the same node (vLLM, llama.cpp standard) continue to
see the correct memory limit; no regression on the Runtime API path.
Signed-off-by: Aurélien Amsellem <hello@aamsellem.com>
a86230c to
a692501
Compare
| // "invalid" and the caller fell through to a 0-byte effective | ||
| // limit, which then failed cuMemAlloc / cuMemoryAllocate with | ||
| // res=2 on the Driver API path even though the device has | ||
| // memory available. |
There was a problem hiding this comment.
please remove the AI comments
archlitchi requested removing the AI-style explanatory comments on the 0m memory limit branch to match the existing terse style of the SM-limit branch above.
|
Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits. 📝 Please follow instructions in the contributing guide to update your commits with the DCO Full details of the Developer Certificate of Origin can be found at developercertificate.org. The list of commits missing DCO signoff:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
please sign-off your commit |
Problem
get_limit_from_env()inmultiprocess_memory_limit.cparsesCUDA_DEVICE_MEMORY_LIMIT_<idx>and returns the byte count. Thefallback path when the parsed value is 0 already handles SM_LIMIT
correctly:
For the memory branch the behaviour is asymmetric: it logs as
"invalid" and returns 0, but the caller
(
do_init_device_memory_limits) treats 0 the same way it treats amissing env var — no per-pod memory cap. There is no "reject" code
path that would trigger on this WARN. So the "invalid" log is at
best misleading, and at worst it gets read as a real configuration
error during triage.
In addition, on at least one downstream that uses HAMi-core
(Olares' device-plugin / bytetrade fork),
CUDA_DEVICE_MEMORY_LIMIT_<idx>is set to
\"0m\"by default when the pod doesn't requestnvidia.com/gpumem. With the current code that emits the WARN, andon the Driver API allocation path (
cuMemAlloc/cuMemoryAllocate),the host-side memory query then reports 0 MiB available, producing
an immediate
res=2(CUDA_ERROR_OUT_OF_MEMORY) on the firstallocation:
The Runtime API path (
cudaMalloc) does not hit the same branch andsees the full device memory, which is why mixed-workload pods on the
same template can have one container running fine and another OOMing
on first alloc.
Fix
Normalise the two branches: 0m / 0g / 0k / 0 are all valid no-limit
encodings for memory just as they are for SM, with an INFO log
instead of a WARN. The function still returns 0 — semantics for
non-zero values are unchanged — and the caller continues to interpret
0 as "no per-pod cap". No change in behaviour for any value other
than zero; the change is purely diagnostic + intent-clarifying.
Repro / verification
Hardware: NVIDIA RTX 5090 Laptop (sm_120) on Olares 1.12.x with the
bundled
hami-device-pluginDaemonSet.nvidia.com/gpu: 1(nogpumemset).CUDA_DEVICE_MEMORY_LIMIT_0=0m.Before this PR: a workload that takes the Driver API path
(e.g. Lucebox DFlash via
test_dflash, or any code callingcuMemoryAllocatedirectly) seesTotal VRAM: 0 MiBand OOMs onfirst alloc. Workaround in user-space is to set the env to a sane
value (e.g.
24000m) in the pod entrypoint.After this PR: the same workload sees the full device VRAM
reported by the underlying NVML query and allocates normally.
Runtime API workloads on the same node template (vLLM, standard
llama.cpp) are unaffected — they were already going through the
NVML fallback for unset limits.
Notes
do_init_device_memory_limitstocollapse the
cur_limit > 0 / fallback / elsechain since afterthis change the behaviour for both 0 and unset is identical, but
that is a larger refactor and feels out of scope.
env_name[12]=='M') is preserved asthe existing convention; happy to introduce a proper enum if the
maintainers prefer.
automated test for this code path exists in-tree so I haven't
added one.