Fix MoE expert outputs not weighted by gate probabilities by MRPRESIDENT66 · Pull Request #201 · tjake/Jlama

MRPRESIDENT66 · 2026-04-16T15:11:51Z

The MoEBlock computes softmax gate probabilities for each expert but only uses them for top-k selection. The expert outputs are summed without being scaled by their gate weights, which diverges from the standard MoE formulation (result = Σ gate_weight_i × expert_i(input)).

This matches the reference HuggingFace Transformers implementation in MixtralSparseMoeBlock where each expert's output is multiplied by its routing weight before accumulation.

The MoEBlock computes softmax gate probabilities for each expert but only uses them for top-k selection. The expert outputs are summed without being scaled by their gate weights, which diverges from the standard MoE formulation (result = Σ gate_weight_i × expert_i(input)). This matches the reference HuggingFace Transformers implementation in MixtralSparseMoeBlock where each expert's output is multiplied by its routing weight before accumulation.

edwardcapriolo · 2026-04-18T12:10:05Z

There are no unit tests here. I will ty with mixtral and see how it goes

edwardcapriolo · 2026-04-18T12:30:53Z

Changes output to useless:

Was:
Edward Capriolo is a writer, director, and producer of theatrical films, television, and documentaries. He is also a professor of film and media studies at the University of North Carolina, Wilmington. He is the founder and director of the Wilmington Film Festival, a juried competition for independent filmmakers. He is also the founder and director of the Wilmington Film Commission, a non-profit organization that supports the local film industry. He>

[main] INFO io.teknek.deliverance.model.AbstractModel - Tensor provider = Native SIMD Operations, parallelSplitSize = 32 
[main] INFO io.teknek.deliverance.model.AbstractModel - Model type = Q4, Working memory type = F32, Quantized memory type = I8
 
1
0
0
0
0
0
0
0
0

 mvn test -Dtest=MixralIT

edwardcapriolo · 2026-04-18T12:42:55Z

Moving the scale BEFORE the results doesnt blow up the output.

 model.configurableTensorProvider.get().scale(gateWeight, moeResult, 0, model.config.embeddingLength);        
                                                                                                                                 
                    // matmul the projection and sum into result                                                                 
                    try (AbstractTensor bufq = model.maybeQuantize(buf)) { 
                    ```
                    
                   But in my limited testing the result is unchanged

Without re-normalization, the selected top-k gate weights (from a softmax over all N experts) sum to less than 1, causing expert outputs to be scaled down proportionally. This shrinks the MoE output magnitude and breaks the residual stream statistics the model was trained with. Re-normalize the selected weights to sum to 1 before applying them, matching the reference HuggingFace Mixtral implementation.

MRPRESIDENT66 · 2026-04-18T17:05:29Z

Moving the scale BEFORE the results doesnt blow up the output.

 model.configurableTensorProvider.get().scale(gateWeight, moeResult, 0, model.config.embeddingLength);        
                                                                                                                                 
                    // matmul the projection and sum into result                                                                 
                    try (AbstractTensor bufq = model.maybeQuantize(buf)) { 
                    ```
                    
                   But in my limited testing the result is unchanged

MRPRESIDENT66 · 2026-04-18T17:06:45Z

Hi, sorry for not being able to test locally with a full Mixtral model.

Regarding your suggestion of moving scale before the matmul — I believe it has no effect because dotProductChunk uses result.set(...) which overwrites moeResult entirely, so any scaling applied before it gets lost.

I also realized my first version of this fix was incomplete. Multiplying by the raw softmax probabilities without re-normalizing causes the output to shrink, because the top-k weights (e.g. top-2 out of 8 experts) sum to less than 1. This is why the output broke in your test.

The updated fix adds a re-normalization step before applying the weights, ensuring the selected top-k weights sum to 1

edwardcapriolo · 2026-04-18T17:11:57Z

Send the PR here. edwardcapriolo/deliverance#87

I dont have merge ability for this project and it is not being maintained ATM

The MixtureOfExpertsBlock computes softmax gate probabilities but only uses them for top-k selection. Expert outputs are summed without being scaled by their gate weights. This fix re-normalizes the selected top-k weights to sum to 1, then scales each expert output by its normalized gate weight before accumulation, matching the HuggingFace Mixtral reference implementation. Fixes: edwardcapriolo#87 See also: tjake/Jlama#201

* Fix MoE expert outputs not weighted by gate probabilities The MixtureOfExpertsBlock computes softmax gate probabilities but only uses them for top-k selection. Expert outputs are summed without being scaled by their gate weights. This fix re-normalizes the selected top-k weights to sum to 1, then scales each expert output by its normalized gate weight before accumulation, matching the HuggingFace Mixtral reference implementation. Fixes: #87 See also: tjake/Jlama#201 * Fix along with pr from mr-pres --------- Co-authored-by: MRPRESIDENT66 <jinmingyijack@163.com>

* Fix MoE expert outputs not weighted by gate probabilities The MixtureOfExpertsBlock computes softmax gate probabilities but only uses them for top-k selection. Expert outputs are summed without being scaled by their gate weights. This fix re-normalizes the selected top-k weights to sum to 1, then scales each expert output by its normalized gate weight before accumulation, matching the HuggingFace Mixtral reference implementation. Fixes: #87 See also: tjake/Jlama#201 * Fix along with pr from mr-pres * changes for removal --------- Co-authored-by: MRPRESIDENT66 <jinmingyijack@163.com>

edwardcapriolo mentioned this pull request Apr 18, 2026

moe not scaled edwardcapriolo/deliverance#87

Closed

MRPRESIDENT66 closed this Apr 18, 2026

MRPRESIDENT66 reopened this Apr 18, 2026

MRPRESIDENT66 mentioned this pull request Apr 18, 2026

Fix MoE expert outputs not weighted by gate probabilities edwardcapriolo/deliverance#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MoE expert outputs not weighted by gate probabilities#201

Fix MoE expert outputs not weighted by gate probabilities#201
MRPRESIDENT66 wants to merge 2 commits into
tjake:mainfrom
MRPRESIDENT66:fix/moe-gate-weight-scaling

MRPRESIDENT66 commented Apr 16, 2026

Uh oh!

edwardcapriolo commented Apr 18, 2026

Uh oh!

edwardcapriolo commented Apr 18, 2026 •

edited

Loading

Uh oh!

edwardcapriolo commented Apr 18, 2026

Uh oh!

MRPRESIDENT66 commented Apr 18, 2026

Uh oh!

MRPRESIDENT66 commented Apr 18, 2026

Uh oh!

edwardcapriolo commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MRPRESIDENT66 commented Apr 16, 2026

Uh oh!

edwardcapriolo commented Apr 18, 2026

Uh oh!

edwardcapriolo commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwardcapriolo commented Apr 18, 2026

Uh oh!

MRPRESIDENT66 commented Apr 18, 2026

Uh oh!

MRPRESIDENT66 commented Apr 18, 2026

Uh oh!

edwardcapriolo commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edwardcapriolo commented Apr 18, 2026 •

edited

Loading