fix(nvml): allow retrying nvmlInit to prevent permanent failure state by maishivamhoo123 · Pull Request #168 · Project-HAMi/HAMi-core

maishivamhoo123 · 2026-03-26T17:45:16Z

Description

Fixes a critical edge case where GPU initialization permanently fails in Kubernetes 1.27+ Burstable pods (e.g., when device files are temporarily locked during resize: InProgress).
Issue Related : #167 and Project-HAMi/HAMi#1704 (comment)

The Bug:
Because nvml_postInit used pthread_once, it was marked as "completed" even if the initial nvmlInit driver call failed. If an application retried the initialization later, nvml_postInit was bypassed, leaving the virtual GPU map permanently broken.

The Fix:

Replaced pthread_once with a pthread_mutex_t and a volatile int flag.
Updated nvmlInit, nvmlInit_v2, and nvmlInitWithFlags to only run and mark nvml_postInit() as complete if the NVIDIA driver returns NVML_SUCCESS.
Applications can now safely retry initialization without poisoning the state.

Changes Made

src/nvml/hook.c: Replaced pthread_once with mutex-guarded success checks for post-initialization.
test/test_nvml_init.c: Added a unit test to verify the retry logic.

Checklist

Code follows the project's coding guidelines.
Tested locally using make build-in-docker (passes cleanly).
Added unit tests to verify the retry behavior.

hami-robot · 2026-03-26T17:45:26Z

Welcome @maishivamhoo123! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

maishivamhoo123 · 2026-03-30T17:47:14Z

@DSFans2014 , @archlitchi @Shouren and @team can you please review these PR?

archlitchi · 2026-04-08T13:28:38Z

i got the issue, but i wonder why replace pthread_once, we can simply check NVML_SUCCESS before calling that

maishivamhoo123 · 2026-04-08T14:18:45Z

i got the issue, but i wonder why replace pthread_once, we can simply check NVML_SUCCESS before calling that

Thank you @archlitchi for your review , yes you are right we should simply check NVML_SUCCESS before calling that , When I initially debugged this, I saw that pthread_once was rigidly locking our state on failure. Because pthread_once doesn't have a reset mechanism, my immediate thought was that the tool itself was too inflexible for a retry loop. Therefore, I focused entirely on writing a manual mutex and boolean flag to give us explicit control over the state. What I completely missed was that we didn't need to replace pthread_once, we just needed to reposition it. I will fix it.

Just to confirm, I will apply this exact same NVML_SUCCESS wrapper to nvmlInit, nvmlInit_v2, and nvmlInitWithFlags. Does that sound good to you?

archlitchi

/lgtm

hami-robot · 2026-04-09T12:43:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, maishivamhoo123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [archlitchi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fishman · 2026-04-10T03:56:21Z

I've been wondering would this be a bit faster with an atomic CAS solution? not sure if that kind of performance matters here.

EDIT: never mind, I noticed you already removed the mutex requirement.

hami-robot · 2026-04-14T01:01:42Z

New changes are detected. LGTM label has been removed.

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

archlitchi · 2026-04-17T02:19:45Z

Hi, looks like we got an issue during build, could you fix that?

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

hami-robot Bot added the dco-signoff: yes label Mar 26, 2026

hami-robot Bot requested review from archlitchi and chaunceyjiang March 26, 2026 17:45

hami-robot Bot added the size/M label Mar 26, 2026

maishivamhoo123 mentioned this pull request Mar 27, 2026

nvidia-smi error in pod when cpu-memory limits!=requests Project-HAMi/HAMi#1704

Open

archlitchi approved these changes Apr 9, 2026

View reviewed changes

hami-robot Bot assigned archlitchi Apr 9, 2026

hami-robot Bot added the lgtm label Apr 9, 2026

hami-robot Bot added the approved label Apr 9, 2026

hami-robot Bot removed the lgtm label Apr 14, 2026

hami-robot Bot added dco-signoff: no size/L and removed dco-signoff: yes size/M labels Apr 14, 2026

maishivamhoo123 added 5 commits April 14, 2026 01:04

fix(nvml): allow retrying nvmlInit on failure

4cf245b

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

fix(nvml): ensure nvml_postInit only runs on NVML_SUCCESS

615d0e3

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

logs meggase added

aed5449

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

deleted launch.json

12ef9f7

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

style: remove trailing whitespaces in hook.c

1e9e821

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

maishivamhoo123 force-pushed the fix-nvmlinit-retry branch from b0a6610 to 1e9e821 Compare April 14, 2026 01:04

hami-robot Bot added dco-signoff: yes and removed dco-signoff: no labels Apr 14, 2026

the build command is passing in the local

ef2c79c

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

maishivamhoo123 requested a review from archlitchi April 18, 2026 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nvml): allow retrying nvmlInit to prevent permanent failure state#168

fix(nvml): allow retrying nvmlInit to prevent permanent failure state#168
maishivamhoo123 wants to merge 6 commits into
Project-HAMi:mainfrom
maishivamhoo123:fix-nvmlinit-retry

maishivamhoo123 commented Mar 26, 2026 •

edited

Loading

Uh oh!

hami-robot Bot commented Mar 26, 2026

Uh oh!

maishivamhoo123 commented Mar 30, 2026 •

edited

Loading

Uh oh!

archlitchi commented Apr 8, 2026

Uh oh!

maishivamhoo123 commented Apr 8, 2026

Uh oh!

archlitchi left a comment

Uh oh!

hami-robot Bot commented Apr 9, 2026

Uh oh!

fishman commented Apr 10, 2026 •

edited

Loading

Uh oh!

hami-robot Bot commented Apr 14, 2026

Uh oh!

archlitchi commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maishivamhoo123 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes Made

Checklist

Uh oh!

hami-robot Bot commented Mar 26, 2026

Uh oh!

maishivamhoo123 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

archlitchi commented Apr 8, 2026

Uh oh!

maishivamhoo123 commented Apr 8, 2026

Uh oh!

archlitchi left a comment

Choose a reason for hiding this comment

Uh oh!

hami-robot Bot commented Apr 9, 2026

Uh oh!

fishman commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hami-robot Bot commented Apr 14, 2026

Uh oh!

archlitchi commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maishivamhoo123 commented Mar 26, 2026 •

edited

Loading

maishivamhoo123 commented Mar 30, 2026 •

edited

Loading

fishman commented Apr 10, 2026 •

edited

Loading

archlitchi commented Apr 17, 2026 •

edited

Loading