feat(qwen3_5_moe): modeling + EP support + multi-axis DTensor merger#171
Merged
Conversation
…pe override Mirrors create_model_from_pretrained — needed so users can disambiguate configs registered under multiple AutoModel mappings (e.g. Qwen3_5MoeConfig is in both causal_lm and image_text_to_text). Without the override qwen3_5_moe silently falls through to AutoModelForCausalLM and we lose the multimodal wrapper.
…guage_model.layers) Qwen3_5MoeForConditionalGeneration's .model is the multimodal wrapper Qwen3_5MoeModel (visual + language_model); decoder layers live one level deeper at .language_model.layers. Mirror qwen3_vl_moe parallelize, which has the same shape. Also wrap the vision tower.
…gits tuple) Upstream Qwen3_5MoeSparseMoeBlock returns Tensor; the (Tensor, router_logits) tuple from qwen3_moe's MoE block is qwen3_moe-specific and breaks the qwen3_5 text_model_forward layer loop (which doesn't unpack tuples). Drop the tuple so qwen3_5_moe can reuse text_model_forward as-is.
self.config on Qwen3_5MoeForConditionalGeneration is the multimodal Qwen3_5MoeConfig; hidden_size/vocab_size live in self.config.text_config. Fall back to text_config gracefully so both ForCausalLM (text-only) and ForConditionalGeneration (multimodal) work.
- consolidate() now handles DTensor placements of any length: maps each rank to a multi-index on the device mesh (C-order stride) and folds shards along each axis's Shard.dim. Replicate axes take the first copy. - For single-axis Shard placements where the mesh axis is size 1 (e.g. dp_shard_mod_ep=1 for non-expert params in an EP-only config), each rank holds a full copy; detect via local-shape == global-shape and treat as replicate to avoid 4x-wide concat artifacts. - Avoid relying on mesh.mesh (requires initialized PG); compute stride from mesh shape assuming default row-major rank layout. - merge() and CLI accept --model_general_type to override AutoModel class for multi-mapping configs (e.g. Qwen3_5MoeConfig is registered under both causal_lm and image_text_to_text). Verified end-to-end on qwen3_5_moe ep_degree=4 tiny checkpoint: merged model reloads as Qwen3_5MoeForConditionalGeneration with experts.gate_up_proj fully consolidated to (num_experts, 2*I, H).
decoder_layer now returns (hidden, router_logits) tuple when requested; new text_model_forward / model_forward collect router_logits across layers and surface them on BaseModelOutputWithPastAndRmpad (which already has the field). lce_forward reads aux-loss config from text_config (matches upstream Qwen3_5MoeForConditionalGeneration which doesn't expose num_experts/router_aux_loss_coef on self). Without this, qwen3_5_moe training had no load-balancing pressure and router could collapse to a few experts over long runs.
kcz358
added a commit
that referenced
this pull request
May 15, 2026
Splits aero's monkey-patch into independent 'liger' and 'rmpad' entries mirroring qwen3_5_moe (PR #171). Each entry dispatches to the right inner backbone via backbone_registry.family_{liger,rmpad}_fn. For qwen3_vl / qwen3_vl_moe / qwen3_5 (still combined-style), the registry adapts via lambda with use_rmpad=True/False + per-flag toggles. qwen3_5_moe uses its native OV2 split entries directly.
kcz358
added a commit
that referenced
this pull request
May 18, 2026
Splits aero's monkey-patch into independent 'liger' and 'rmpad' entries mirroring qwen3_5_moe (PR #171). Each entry dispatches to the right inner backbone via backbone_registry.family_{liger,rmpad}_fn. For qwen3_vl / qwen3_vl_moe / qwen3_5 (still combined-style), the registry adapts via lambda with use_rmpad=True/False + per-flag toggles. qwen3_5_moe uses its native OV2 split entries directly.
kcz358
added a commit
that referenced
this pull request
May 19, 2026
Splits aero's monkey-patch into independent 'liger' and 'rmpad' entries mirroring qwen3_5_moe (PR #171). Each entry dispatches to the right inner backbone via backbone_registry.family_{liger,rmpad}_fn. For qwen3_vl / qwen3_vl_moe / qwen3_5 (still combined-style), the registry adapts via lambda with use_rmpad=True/False + per-flag toggles. qwen3_5_moe uses its native OV2 split entries directly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
lmms_engine/models/qwen3_5_moe/: OV2-style split monkey patches (`apply_liger_kernel_to_qwen3_5_moe` + `apply_rmpad_to_qwen3_5_moe`), MoE-specific forwards (decoder w/ hybrid linear/full attention, sparse-moe block w/ shared expert, stacked-experts), and `lce_forward`.Tests
Limitations
Files
22 files changed, +1260 / -37. See commits for atomic, single-responsibility breakdown.