Fuse per-sequence AlltoAll into a unified one in GDN forward#4913
Fuse per-sequence AlltoAll into a unified one in GDN forward#4913xuantengh wants to merge 9 commits into
Conversation
|
/claude review |
|
/ok to test 8dcfd7d |
2c59663 to
6d28256
Compare
|
/ok to test 6d28256 |
|
/claude review |
f5714af to
efb43a2
Compare
|
/claude review |
There was a problem hiding this comment.
Light review: the core logic for fusing per-sequence AlltoAll into a single batched a2a + local permutations looks correct. The mamba_context_parallel.py cleanup is a clean replacement with existing utility functions. Test coverage in TestFusedThdAllToAll is solid — it validates both directions and a round-trip identity check across multiple cu_seqlens configurations and cp_sizes.
Only nit: unused import os and a commented-out os.environ.setdefault line in the test file (see inline comments).
efb43a2 to
de01425
Compare
4404bc7 to
be051c4
Compare
be051c4 to
a0044a5
Compare
| return unpacked_x | ||
| def _build_thd_cp_a2a_perm( | ||
| cu_seqlens: torch.Tensor, cp_size: int, t_global: int | ||
| ) -> Tuple[torch.Tensor, torch.Tensor]: |
There was a problem hiding this comment.
Nit: let's move toward new-style type specification using built-in tuple instead of typing.Tuple.
|
|
||
| @lru_cache(maxsize=8) | ||
| def _build_head_perm_for_split_sections( | ||
| split_sections: Tuple[int], cp_size: int, device: torch.device |
There was a problem hiding this comment.
Same nit as before, but also, the type should be tuple[int, ...].
What does this PR do ?
PR #2645 adds the THD support for GDN, but it requires per-sequence, per-section (qkv, alpha, beta, gate) AlltoAll converting activation between CP <-> HP. This PR fuses the per-sequence AlltoAll into a unified one, with local indexing mappings to reorder at the sequence dimension and feature dimension.
wandb link for running a Qwen3 Next proxy model with 16K sequence length and CP size = 4, the iteration time speedup is around 7% - 10%.
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.