fix(nvml): allow retrying nvmlInit to prevent permanent failure state#168
fix(nvml): allow retrying nvmlInit to prevent permanent failure state#168maishivamhoo123 wants to merge 6 commits into
Conversation
|
Welcome @maishivamhoo123! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉 |
|
@DSFans2014 , @archlitchi @Shouren and @team can you please review these PR? |
|
i got the issue, but i wonder why replace pthread_once, we can simply check NVML_SUCCESS before calling that |
Thank you @archlitchi for your review , yes you are right we should simply check NVML_SUCCESS before calling that , When I initially debugged this, I saw that pthread_once was rigidly locking our state on failure. Because pthread_once doesn't have a reset mechanism, my immediate thought was that the tool itself was too inflexible for a retry loop. Therefore, I focused entirely on writing a manual mutex and boolean flag to give us explicit control over the state. What I completely missed was that we didn't need to replace pthread_once, we just needed to reposition it. I will fix it. Just to confirm, I will apply this exact same NVML_SUCCESS wrapper to nvmlInit, nvmlInit_v2, and nvmlInitWithFlags. Does that sound good to you? |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, maishivamhoo123 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I've been wondering would this be a bit faster with an atomic CAS solution? not sure if that kind of performance matters here. EDIT: never mind, I noticed you already removed the mutex requirement. |
|
New changes are detected. LGTM label has been removed. |
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
b0a6610 to
1e9e821
Compare
|
Hi, looks like we got an issue during build, could you fix that? |
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Description
Fixes a critical edge case where GPU initialization permanently fails in Kubernetes 1.27+ Burstable pods (e.g., when device files are temporarily locked during
resize: InProgress).Issue Related : #167 and Project-HAMi/HAMi#1704 (comment)
The Bug:
Because
nvml_postInitusedpthread_once, it was marked as "completed" even if the initialnvmlInitdriver call failed. If an application retried the initialization later,nvml_postInitwas bypassed, leaving the virtual GPU map permanently broken.The Fix:
pthread_oncewith apthread_mutex_tand avolatile intflag.nvmlInit,nvmlInit_v2, andnvmlInitWithFlagsto only run and marknvml_postInit()as complete if the NVIDIA driver returnsNVML_SUCCESS.Changes Made
src/nvml/hook.c: Replacedpthread_oncewith mutex-guarded success checks for post-initialization.test/test_nvml_init.c: Added a unit test to verify the retry logic.Checklist
make build-in-docker(passes cleanly).