Fix SM utilization reporting for forked CUDA worker processes by lyquid617 · Pull Request #199 · Project-HAMi/HAMi-core

lyquid617 · 2026-05-29T02:25:20Z

Problem

For Python multiprocessing workloads using HAMi-core, Device_utilization_desc_of_container may stay at 0 even when GPU processes are actively running.

In the affected workload:

nvidia-smi pmon inside the container reports ~97% SM utilization per worker process.
nvmlDeviceGetProcessUtilization inside the container returns correct per-process samples.
vGPU_device_memory_usage_in_bytes is updated correctly.
Device_utilization_desc_of_container remains 0.
The shared cache has valid memory usage and updated last_kernel_time, but device_util[].sm_util stays 0.

Root Cause

postInit() starts the utilization watcher via init_utilization_watcher(), but it is guarded by pthread_once(post_cuinit_flag) and is only reliably triggered through the
cuInit() wrapper.

For forked Python multiprocessing workers, the child process may inherit a completed post_cuinit_flag from the parent. Later CUDA kernel launches in the worker call the hooked launch path, but postInit() is skipped, so host PID detection and utilization watcher initialization never happen in the worker process.

There is also an existing multi-GPU bug in init_gpu_device_utilization(): the inner loop breaks after device 0, so only the first device is reset. See#148

Fix

Add ensure_post_init() and call it from kernel launch wrappers.
Reset post_cuinit_flag and pidfound in the child process after fork.
Remove the erroneous break in init_gpu_device_utilization().

Validation

Tested with an 8-GPU Python multiprocessing workload:

Before:

nvidia-smi pmon: ~97% SM utilization
Device_utilization_desc_of_container: 0

After:

Device_utilization_desc_of_container: 97-98 per GPU
vGPU memory metrics remain correct

hami-robot · 2026-05-29T02:25:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lyquid617
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2026-05-29T02:25:31Z

Welcome @lyquid617! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

archlitchi · 2026-06-04T07:51:30Z

thanks for the fix, please sign-off your commit

Signed-off-by: lyquid <l000064@lassoquant.com>

lyquid617 · 2026-06-04T09:43:09Z

thanks for the fix, please sign-off your commit

fixed

hami-robot Bot added the dco-signoff: no label May 29, 2026

hami-robot Bot requested review from archlitchi and chaunceyjiang May 29, 2026 02:25

hami-robot Bot added the size/S label May 29, 2026

lyquid617 force-pushed the fix-util-init branch from bbe5740 to 201d477 Compare June 4, 2026 09:10

hami-robot Bot added dco-signoff: yes and removed dco-signoff: no labels Jun 4, 2026

fix cuda init

77f4c28

Signed-off-by: lyquid <l000064@lassoquant.com>

lyquid617 force-pushed the fix-util-init branch from 201d477 to 4c5310c Compare June 4, 2026 09:19

fix cuda init for subprocess

03e8938

Signed-off-by: lyquid <l000064@lassoquant.com>

lyquid617 force-pushed the fix-util-init branch from 4c5310c to 03e8938 Compare June 4, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SM utilization reporting for forked CUDA worker processes#199

Fix SM utilization reporting for forked CUDA worker processes#199
lyquid617 wants to merge 2 commits into
Project-HAMi:mainfrom
lyquid617:fix-util-init

lyquid617 commented May 29, 2026

Uh oh!

hami-robot Bot commented May 29, 2026

Uh oh!

hami-robot Bot commented May 29, 2026

Uh oh!

archlitchi commented Jun 4, 2026

Uh oh!

lyquid617 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lyquid617 commented May 29, 2026

Problem

Root Cause

Fix

Validation

Uh oh!

hami-robot Bot commented May 29, 2026

Uh oh!

hami-robot Bot commented May 29, 2026

Uh oh!

archlitchi commented Jun 4, 2026

Uh oh!

lyquid617 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants