Optimize balancer and setup debug logger. by JacoCheung · Pull Request #308 · NVIDIA/recsys-examples

JacoCheung · 2026-02-11T15:24:48Z

Description

This PR aims to fully hide the balancer overhead (KK algorithm on host) and optimize the comms of batch allgather.

Besides, a set of helper utils is added.

HSTU kernel SoL benchmark. (dense shape)
- The peak sol is about fwd 60% , bwd %50.
HSTU kernel balancer optimization benchmark. (the hstu speedup when input is shuffled evenly)
In-flight hstu mfu logging.
Balancer logger.
Dataset seqlen distribution specs.

greptile-apps · 2026-02-11T15:39:31Z

Greptile Overview

Greptile Summary

This PR optimizes the balancer overhead and batch allgather communication in the training pipeline by implementing two key improvements:

1. Background Thread for Balancer: Offloaded H2D transfer and batch shuffling (including Karmarkar-Karp load balancing) to a background thread using ThreadPoolExecutor. The main thread continues with forward pass while the background thread handles data movement on a separate CUDA stream (_memcpy_stream). Critical synchronization point added before the main thread touches the DP communicator to prevent NCCL deadlock.

2. Fused KJT AllGather: Replaced per-KJT gathering with keyed_jagged_tensor_list_allgather that concatenates all KJT lengths/values and performs only 2 NCCL calls total (1 for lengths, 1 for values), regardless of the number of KJTs. Uses keyed_jagged_index_select_dim1 for efficient rank-major to key-major layout transpose.

Additional improvements:

Removed unnecessary barrier() in _gatherv_along_first_dim (NCCL already synchronizes streams)
Added debug logging for load balance statistics (controlled by PRINT_LOAD_BALANCE env var)
Added HSTU attention performance tracking with MFU calculation
Refactored shuffle into two phases (compute_partition_indices and shuffle_batch_by_global_indices) for better modularity
Added utility classes for random distributions in benchmark data generation

Confidence Score: 4/5

Safe to merge with careful testing of concurrency and NCCL ordering
The PR implements sophisticated concurrency optimizations with proper NCCL synchronization points. The background threading and fused communication changes are well-documented and follow correct ordering constraints. However, the complexity of concurrent NCCL calls and stream synchronization requires thorough multi-GPU testing to ensure no deadlocks or race conditions occur in production workloads.
Pay close attention to examples/commons/pipeline/train_pipeline.py (concurrent NCCL usage) and examples/commons/ops/collective_ops.py (fused KJT allgather logic)

Important Files Changed

Filename	Overview
examples/commons/distributed/batch_shuffler.py	Refactored shuffle into two phases (compute_partition_indices and shuffle_batch_by_global_indices) for better separation; added load balance logging with env var controls
examples/commons/distributed/batch_allgather.py	Optimized to fuse all KJT fields into single AllGather call pair using keyed_jagged_tensor_list_allgather instead of per-KJT gathering
examples/commons/ops/collective_ops.py	Added keyed_jagged_tensor_list_allgather for fused KJT gathering (2 NCCL calls total); removed unnecessary barrier in _gatherv_along_first_dim
examples/commons/pipeline/train_pipeline.py	Added ThreadPoolExecutor to offload H2D + batch shuffle to background thread; wait for shuffle completion before main thread touches DP communicator to avoid NCCL deadlock
examples/commons/utils/logger.py	Added debug_rank_0, print_rank_all, info_rank_all, debug_rank_all functions; added LOG_LEVEL env var support; enhanced print_rank_0 with log level parameter
examples/commons/utils/attn_perf_tracker.py	New file for HSTU attention performance tracking with lazy printing, auto-detection of training vs inference mode, MFU calculation

Sequence Diagram

sequenceDiagram
    participant Main as Main Thread
    participant BG as Background Thread
    participant GPU as GPU Stream
    participant NCCL as NCCL (DP Group)
    
    Note over Main,NCCL: Batch i+2 Processing (Optimized)
    Main->>Main: _next_batch(dataloader)
    Main->>BG: submit(_h2d_and_shuffle, batch)
    activate BG
    BG->>GPU: _to_device (H2D on memcpy_stream)
    BG->>NCCL: AllGather workloads
    BG->>NCCL: Karmarkar-Karp partitioning
    BG->>NCCL: AllGather KJTs (fused, 2 calls)
    BG->>NCCL: AllGather dense tensors
    deactivate BG
    
    Note over Main: Batch i Processing (overlapped)
    Main->>GPU: Forward pass (default stream)
    Main->>BG: future.result() [WAIT HERE]
    activate BG
    BG-->>Main: shuffled_batch
    deactivate BG
    
    Note over Main,NCCL: Safe to use DP communicator
    Main->>NCCL: AllReduce loss (DP group)
    Main->>GPU: Backward pass
    Main->>NCCL: AllReduce gradients (DP group)

greptile-apps

_{21 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

examples/commons/distributed/batch_shuffler.py

greptile-apps

_{24 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-12T12:06:28Z

examples/commons/utils/logger.py

+def print_rank_all(message, level=logging.INFO):
+    """If distributed is initialized, print only on rank 0."""


Docstring is incorrect - says "print only on rank 0" but print_rank_all actually prints on all ranks.

Suggested change

def print_rank_all(message, level=logging.INFO):

"""If distributed is initialized, print only on rank 0."""

"""If distributed is initialized, print on all ranks."""

greptile-apps · 2026-02-12T12:06:29Z

examples/commons/utils/logger.py

+def info_rank_all(message):
+    """If distributed is initialized, print only on rank 0."""


Docstring is incorrect - says "print only on rank 0" but info_rank_all actually prints on all ranks.

Suggested change

def info_rank_all(message):

"""If distributed is initialized, print only on rank 0."""

"""If distributed is initialized, print on all ranks."""

greptile-apps · 2026-02-12T12:06:30Z

examples/commons/utils/logger.py

+def debug_rank_all(message):
+    """If distributed is initialized, print only on rank 0."""


Docstring is incorrect - says "print only on rank 0" but debug_rank_all actually prints on all ranks.

Suggested change

def debug_rank_all(message):

"""If distributed is initialized, print only on rank 0."""

"""If distributed is initialized, print on all ranks."""

greptile-apps

_{24 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-12T12:58:59Z

examples/commons/pipeline/train_pipeline.py

+        self._shuffle_executor = concurrent.futures.ThreadPoolExecutor(
+            max_workers=1, thread_name_prefix="shuffle"
+        )
+        self._shuffle_future: Optional["concurrent.futures.Future[Optional[In]]"] = None


Consider adding cleanup for _shuffle_executor to ensure graceful thread termination:

def __del__(self): if hasattr(self, '_shuffle_executor'): self._shuffle_executor.shutdown(wait=True)

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

examples/commons/distributed/batch_shuffler.py Outdated Show resolved Hide resolved

JacoCheung added 9 commits February 12, 2026 04:31

Opt kjt AG

d911d61

Offload h2d+shuflle to BG thread

3c1449a

Add balancer msg logging

ae540df

Allow to specify the Dist of random dataset

6079e13

Add lognormal dist

e6a7dce

Add hstu sol benchmark

2d1353a

Add hstu attention tflops debug logging

5c2fedf

Divide shuffle api into 2 stages

5c91051

Add standalone balanced attn benchmark

2e3a262

JacoCheung force-pushed the junzhang/opt_balancer branch from 8b68ea4 to 2e3a262 Compare February 12, 2026 12:02

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

Fix logger comments

fb4c75d

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize balancer and setup debug logger.#308

Optimize balancer and setup debug logger.#308
JacoCheung wants to merge 10 commits intoNVIDIA:mainfrom
JacoCheung:junzhang/opt_balancer

JacoCheung commented Feb 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 12, 2026

Uh oh!

greptile-apps bot Feb 12, 2026

Uh oh!

greptile-apps bot Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		def print_rank_all(message, level=logging.INFO):
		"""If distributed is initialized, print only on rank 0."""

		def info_rank_all(message):
		"""If distributed is initialized, print only on rank 0."""

		def debug_rank_all(message):
		"""If distributed is initialized, print only on rank 0."""

Conversation

JacoCheung commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

greptile-apps bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JacoCheung commented Feb 11, 2026 •

edited

Loading

greptile-apps bot commented Feb 11, 2026 •

edited

Loading