feat: reduce nb experts per token in moe architectures#450
Merged
Conversation
a93a18c to
b3a0ac5
Compare
Collaborator
Author
|
@cursor review |
10 tasks
simlang
reviewed
Dec 12, 2025
Member
simlang
left a comment
There was a problem hiding this comment.
Love it, super straightforward algorithm. commented on some higher level stuff.
e4e719b to
04ee0cf
Compare
|
This PR has been inactive for 10 days and is now marked as stale. |
simlang
approved these changes
Jan 15, 2026
src/pruna/algorithms/reduce_noe.py
Outdated
| else: | ||
| with config_path.open("r", encoding="utf-8") as f: | ||
| config_json = json.load(f) | ||
| target_names = smash_config["target_name"] |
Member
There was a problem hiding this comment.
super minor, but as it's only one target_name, the variable should also be called target_name imo
src/pruna/algorithms/reduce_noe.py
Outdated
| ReduceNOE is a method to Reduce the Number Of Experts per token. | ||
| """ | ||
|
|
||
| algorithm_name: str = "red_noe" |
Member
There was a problem hiding this comment.
up to you, if it makes sense to also change the algorithm_name to "reduce_noe". I personally prefer a clearer naming, but that's really just an opinion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR is a little tool that only acts on MoE models (for LLMs and Hunyuan3Image for now) by reducing the number of experts that are trigered for each token.
All models have been trained on a fix amount of active experts per token, and decreasing this number alter the output of the model. This idea was tested on Hunyuan3Image, gptoss_120b (for hunyaun3 image, (default is 8 out of 128 experts) 1 and 2 give very weird images: 4 experts seems ok, and yields 15% speedup. For gptoss120b, (default is 4 experts) 1 and 2 give very weird texts, and yields no speedup), but is applicable to any MoE, eg Mixtral, QwenNext, etc.
Related Issue
Fixes #(issue number)
Type of Change
How Has This Been Tested?
Checklist
Additional Notes
Notebook to test the new feature available here.