Skip to content

Disable TE cross entropy loss fusion#5115

Merged
ko3n1g merged 2 commits into
NVIDIA:mainfrom
mchrzanowski:disable-te-cross-entropy-fusion
Jun 3, 2026
Merged

Disable TE cross entropy loss fusion#5115
ko3n1g merged 2 commits into
NVIDIA:mainfrom
mchrzanowski:disable-te-cross-entropy-fusion

Conversation

@mchrzanowski
Copy link
Copy Markdown
Contributor

@mchrzanowski mchrzanowski commented Jun 2, 2026

[X] I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Disables the Transformer Engine implementation of cross entropy loss fusion with an assertion due to observed training stability issues, while keeping native cross entropy fusion available.

Issue tracking

Linked issue: N/A, small stability bug fix.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Validation

  • Added a unit test covering rejection of cross_entropy_loss_fusion=True with cross_entropy_fusion_impl='te'.
  • Updated existing functional test configs and MoE README guidance to use native.
  • Ran python -m py_compile on changed Python files.
  • Ran git diff --check.
  • Could not run pytest in the local environment because torch is not installed.

@mchrzanowski mchrzanowski requested review from a team as code owners June 2, 2026 19:46
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft June 2, 2026 19:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@mchrzanowski mchrzanowski marked this pull request as ready for review June 2, 2026 19:48
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR disables the Transformer Engine implementation of cross-entropy loss fusion by adding a guard that raises an error when cross_entropy_loss_fusion=True is combined with cross_entropy_fusion_impl='te', while leaving the native fusion path fully available.

  • Adds the guard in both ModelParallelConfig.__post_init__ and validate_args, so the restriction applies whether configuration is supplied via the Python API or the CLI.
  • Updates six functional test YAML configs, one unit test fixture, and the MoE README to switch from te to native.
  • Adds a new tests/unit_tests/test_model_parallel_config.py covering both the rejected and the allowed paths.

Confidence Score: 4/5

The change correctly blocks the unstable TE fusion path; the only concern is that model_parallel_config.py uses assert while every other guard in that same file uses raise ValueError, leaving the check silently skippable under python -O.

In model_parallel_config.py the new guard uses assert while all six surrounding validation checks raise ValueError. Because Python's -O flag strips assert statements, the intended stability block would be bypassed silently in any optimised deployment that passes cross_entropy_fusion_impl='te' through the direct Python API rather than the CLI path. The fix is a one-line change and the rest of the PR is clean.

megatron/core/model_parallel_config.py and tests/unit_tests/test_model_parallel_config.py (the test expects AssertionError and will need updating alongside the fix)

Important Files Changed

Filename Overview
megatron/core/model_parallel_config.py Adds a guard against TE cross-entropy fusion using assert instead of raise ValueError, inconsistent with all other validation in the file and bypassable with python -O.
megatron/training/arguments.py Adds TE cross-entropy fusion guard in validate_args using assert, consistent with the 188 other assert-based validations in this file.
tests/unit_tests/test_model_parallel_config.py New unit test file; tests both the disabled TE path (expects AssertionError) and the allowed native path — will need updating if the guard is changed to raise ValueError.
tests/unit_tests/models/mimo/test_mimo_1f1b_schedule.py Updates cross_entropy_fusion_impl from 'te' to 'native' in test fixture to comply with the new guard.
megatron/core/transformer/moe/README.md Updates recommended flag from te to native in performance flags documentation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User sets cross_entropy_loss_fusion=True] --> B{cross_entropy_fusion_impl?}
    B -- "'te'" --> C["ModelParallelConfig.__post_init__\nassert / raise ValueError"]
    C --> D[AssertionError / ValueError\n'TE fusion disabled due to stability issues']
    B -- "'native'" --> E[Validation passes]
    E --> F[Native cross-entropy fusion used]

    G[CLI: --cross-entropy-loss-fusion] --> H{--cross-entropy-fusion-impl?}
    H -- "te" --> I["validate_args assert"]
    I --> J[AssertionError\n'Use --cross-entropy-fusion-impl native']
    H -- "native" --> K[Validation passes]
    K --> L[Native cross-entropy fusion used]
Loading

Reviews (1): Last reviewed commit: "Disable TE cross entropy loss fusion" | Re-trigger Greptile

Comment thread megatron/core/model_parallel_config.py Outdated
Comment on lines +437 to +442
assert not (
self.cross_entropy_loss_fusion and self.cross_entropy_fusion_impl == 'te'
), (
"Transformer Engine cross entropy loss fusion is disabled due to stability issues. "
"Use cross_entropy_fusion_impl='native', or disable cross_entropy_loss_fusion."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Every other validation in __post_init__ uses raise ValueError(...), but this guard uses assert. Running Python with the -O (optimize) flag silently disables all assert statements, meaning the TE fusion combination would be permitted and the stability regression would reappear without any error. Use raise ValueError here to match the surrounding code and make the guard unconditional.

Suggested change
assert not (
self.cross_entropy_loss_fusion and self.cross_entropy_fusion_impl == 'te'
), (
"Transformer Engine cross entropy loss fusion is disabled due to stability issues. "
"Use cross_entropy_fusion_impl='native', or disable cross_entropy_loss_fusion."
)
if self.cross_entropy_loss_fusion and self.cross_entropy_fusion_impl == 'te':
raise ValueError(
"Transformer Engine cross entropy loss fusion is disabled due to stability issues. "
"Use cross_entropy_fusion_impl='native', or disable cross_entropy_loss_fusion."
)

Comment on lines +8 to +10
def test_te_cross_entropy_loss_fusion_is_disabled():
with pytest.raises(AssertionError, match="Transformer Engine cross entropy loss fusion"):
ModelParallelConfig(cross_entropy_loss_fusion=True, cross_entropy_fusion_impl='te')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 If the guard in model_parallel_config.py is changed from assert to raise ValueError (as suggested), this test will need to match ValueError instead of AssertionError.

Suggested change
def test_te_cross_entropy_loss_fusion_is_disabled():
with pytest.raises(AssertionError, match="Transformer Engine cross entropy loss fusion"):
ModelParallelConfig(cross_entropy_loss_fusion=True, cross_entropy_fusion_impl='te')
def test_te_cross_entropy_loss_fusion_is_disabled():
with pytest.raises(ValueError, match="Transformer Engine cross entropy loss fusion"):
ModelParallelConfig(cross_entropy_loss_fusion=True, cross_entropy_fusion_impl='te')

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@ko3n1g ko3n1g added the core_r0.18.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Jun 2, 2026
@mchrzanowski mchrzanowski force-pushed the disable-te-cross-entropy-fusion branch from 2fac253 to 89daa0f Compare June 3, 2026 02:12
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Jun 3, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Jun 3, 2026
@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Jun 3, 2026

/ok to test eab65fb

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26906455119

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 3, 2026
@ko3n1g ko3n1g added this pull request to the merge queue Jun 3, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26914187574

Merged via the queue into NVIDIA:main with commit 168cb15 Jun 3, 2026
86 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low core_r0.18.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants