Skip to content

fix(nvml): allow retrying nvmlInit to prevent permanent failure state#168

Open
maishivamhoo123 wants to merge 6 commits into
Project-HAMi:mainfrom
maishivamhoo123:fix-nvmlinit-retry
Open

fix(nvml): allow retrying nvmlInit to prevent permanent failure state#168
maishivamhoo123 wants to merge 6 commits into
Project-HAMi:mainfrom
maishivamhoo123:fix-nvmlinit-retry

Conversation

@maishivamhoo123
Copy link
Copy Markdown
Member

@maishivamhoo123 maishivamhoo123 commented Mar 26, 2026

Description

Fixes a critical edge case where GPU initialization permanently fails in Kubernetes 1.27+ Burstable pods (e.g., when device files are temporarily locked during resize: InProgress).
Issue Related : #167 and Project-HAMi/HAMi#1704 (comment)

The Bug:
Because nvml_postInit used pthread_once, it was marked as "completed" even if the initial nvmlInit driver call failed. If an application retried the initialization later, nvml_postInit was bypassed, leaving the virtual GPU map permanently broken.

The Fix:

  • Replaced pthread_once with a pthread_mutex_t and a volatile int flag.
  • Updated nvmlInit, nvmlInit_v2, and nvmlInitWithFlags to only run and mark nvml_postInit() as complete if the NVIDIA driver returns NVML_SUCCESS.
  • Applications can now safely retry initialization without poisoning the state.

Changes Made

  • src/nvml/hook.c: Replaced pthread_once with mutex-guarded success checks for post-initialization.
  • test/test_nvml_init.c: Added a unit test to verify the retry logic.

Checklist

  • Code follows the project's coding guidelines.
  • Tested locally using make build-in-docker (passes cleanly).
  • Added unit tests to verify the retry behavior.

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Mar 26, 2026

Welcome @maishivamhoo123! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

@maishivamhoo123
Copy link
Copy Markdown
Member Author

maishivamhoo123 commented Mar 30, 2026

@DSFans2014 , @archlitchi @Shouren and @team can you please review these PR?

@archlitchi
Copy link
Copy Markdown
Member

i got the issue, but i wonder why replace pthread_once, we can simply check NVML_SUCCESS before calling that

@maishivamhoo123
Copy link
Copy Markdown
Member Author

i got the issue, but i wonder why replace pthread_once, we can simply check NVML_SUCCESS before calling that

Thank you @archlitchi for your review , yes you are right we should simply check NVML_SUCCESS before calling that , When I initially debugged this, I saw that pthread_once was rigidly locking our state on failure. Because pthread_once doesn't have a reset mechanism, my immediate thought was that the tool itself was too inflexible for a retry loop. Therefore, I focused entirely on writing a manual mutex and boolean flag to give us explicit control over the state. What I completely missed was that we didn't need to replace pthread_once, we just needed to reposition it. I will fix it.

Just to confirm, I will apply this exact same NVML_SUCCESS wrapper to nvmlInit, nvmlInit_v2, and nvmlInitWithFlags. Does that sound good to you?

Copy link
Copy Markdown
Member

@archlitchi archlitchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, maishivamhoo123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot Bot added the approved label Apr 9, 2026
@fishman
Copy link
Copy Markdown

fishman commented Apr 10, 2026

I've been wondering would this be a bit faster with an atomic CAS solution? not sure if that kind of performance matters here.

EDIT: never mind, I noticed you already removed the mutex requirement.

@hami-robot hami-robot Bot removed the lgtm label Apr 14, 2026
@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 14, 2026

New changes are detected. LGTM label has been removed.

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
@archlitchi
Copy link
Copy Markdown
Member

archlitchi commented Apr 17, 2026

Hi, looks like we got an issue during build, could you fix that?

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants