feat(qwen3_5_moe): modeling + EP support + multi-axis DTensor merger by kcz358 · Pull Request #171 · EvolvingLMMs-Lab/lmms-engine

kcz358 · 2026-05-15T08:36:46Z

Summary

Adds qwen3_5_moe (Qwen3.5-MoE / Qwen3.6-35B-A3B family) integration:
- lmms_engine/models/qwen3_5_moe/: OV2-style split monkey patches (`apply_liger_kernel_to_qwen3_5_moe` + `apply_rmpad_to_qwen3_5_moe`), MoE-specific forwards (decoder w/ hybrid linear/full attention, sparse-moe block w/ shared expert, stacked-experts), and `lce_forward`.
- Reuses qwen3_5 dense forwards (`attn_forward`, `linear_attn_forward`, `text_model_forward`, `model_forward`) by import — widened their `self:` type hints to `Union[dense, moe]`.
Expert Parallel:
- `lmms_engine/parallel/qwen3_5_moe/`: `Qwen3_5MoeParallelStyle` + `apply_qwen3_5_moe_parallelize_fn` (FSDP wrap branches on `decoder_layer.layer_type` for linear_attn vs self_attn).
- Registered under `MODEL_TO_PARALLEL_METHOD["qwen3_5_moe"]`.
Merger EP support (`src/lmms_engine/merger/fsdp2.py`):
- `consolidate()` handles multi-axis DTensor (FSDP2 + EP) placements by mapping each rank to a mesh multi-index and folding along each axis's `Shard.dim` / Replicate.
- Detects single-axis `Shard` with mesh_size=1 (e.g. `dp_shard_mod_ep=1` for non-expert params in an EP-only config) and treats as replicate to avoid spurious 4×-wide concat.
- Avoids `mesh.mesh` (requires initialized PG offline); computes stride from mesh shape assuming default row-major rank layout.
API: `create_model_from_config` + merger CLI gain `model_general_type` override so configs registered under multiple AutoModel mappings (e.g. `Qwen3_5MoeConfig` is in both `causal_lm` and `image_text_to_text`) can be disambiguated.
Drive-by: `qwen3_moe_ops.attn_forward` switched from direct `flash_attn_varlen_func` to unified `varlen_attn(backend=self.config._attn_implementation)` for SDPA portability.

Tests

New CI/CD test `test/train/qwen3_5_moe/test_qwen3_5_moe.py`: tiny random-init Qwen3_5MoeForConditionalGeneration (~100M params), ep_degree=4, 10 steps end-to-end. Verified via `bash cicd/run_traincicd.sh --model-name qwen3_5_moe --gpu-count 4`: passes in ~56s.
Existing `test_qwen3_moe` still passes (drive-by varlen_attn refactor verified compatible).
Merger regression: consolidating an existing FSDP2 non-EP checkpoint (`MidTrain7to3-QuickSFT-3_5K-OpenMM-ColdStart/checkpoint-3000`, 8 shards) yields shapes matching the model config (vocab×hidden, vision dims, attention proj dims) — no behavioral change for non-EP ckpts.
Merger EP path: consolidating the tiny ep_degree=4 ckpt produces a valid `Qwen3_5MoeForConditionalGeneration` that reloads via `AutoModelForImageTextToText.from_pretrained` with experts.gate_up_proj fully reconstituted to `(num_experts, 2*I, H)`.

Limitations

No Ulysses SP support (`sp_ulysses_degree` must be 1). qwen3_5 linear-attention upstream path isn't SP-safe yet.
Production-scale `Qwen/Qwen3.6-35B-A3B` requires ≥ 8× A100 80G with `ep_degree=8`; smaller boxes can use lower EP.

Files

22 files changed, +1260 / -37. See commits for atomic, single-responsibility breakdown.

…tion portability

…ttn/text_model

…ared forwards

…eExperts)

…pe override Mirrors create_model_from_pretrained — needed so users can disambiguate configs registered under multiple AutoModel mappings (e.g. Qwen3_5MoeConfig is in both causal_lm and image_text_to_text). Without the override qwen3_5_moe silently falls through to AutoModelForCausalLM and we lose the multimodal wrapper.

…guage_model.layers) Qwen3_5MoeForConditionalGeneration's .model is the multimodal wrapper Qwen3_5MoeModel (visual + language_model); decoder layers live one level deeper at .language_model.layers. Mirror qwen3_vl_moe parallelize, which has the same shape. Also wrap the vision tower.

…gits tuple) Upstream Qwen3_5MoeSparseMoeBlock returns Tensor; the (Tensor, router_logits) tuple from qwen3_moe's MoE block is qwen3_moe-specific and breaks the qwen3_5 text_model_forward layer loop (which doesn't unpack tuples). Drop the tuple so qwen3_5_moe can reuse text_model_forward as-is.

self.config on Qwen3_5MoeForConditionalGeneration is the multimodal Qwen3_5MoeConfig; hidden_size/vocab_size live in self.config.text_config. Fall back to text_config gracefully so both ForCausalLM (text-only) and ForConditionalGeneration (multimodal) work.

- consolidate() now handles DTensor placements of any length: maps each rank to a multi-index on the device mesh (C-order stride) and folds shards along each axis's Shard.dim. Replicate axes take the first copy. - For single-axis Shard placements where the mesh axis is size 1 (e.g. dp_shard_mod_ep=1 for non-expert params in an EP-only config), each rank holds a full copy; detect via local-shape == global-shape and treat as replicate to avoid 4x-wide concat artifacts. - Avoid relying on mesh.mesh (requires initialized PG); compute stride from mesh shape assuming default row-major rank layout. - merge() and CLI accept --model_general_type to override AutoModel class for multi-mapping configs (e.g. Qwen3_5MoeConfig is registered under both causal_lm and image_text_to_text). Verified end-to-end on qwen3_5_moe ep_degree=4 tiny checkpoint: merged model reloads as Qwen3_5MoeForConditionalGeneration with experts.gate_up_proj fully consolidated to (num_experts, 2*I, H).

…sage

decoder_layer now returns (hidden, router_logits) tuple when requested; new text_model_forward / model_forward collect router_logits across layers and surface them on BaseModelOutputWithPastAndRmpad (which already has the field). lce_forward reads aux-loss config from text_config (matches upstream Qwen3_5MoeForConditionalGeneration which doesn't expose num_experts/router_aux_loss_coef on self). Without this, qwen3_5_moe training had no load-balancing pressure and router could collapse to a few experts over long runs.

Splits aero's monkey-patch into independent 'liger' and 'rmpad' entries mirroring qwen3_5_moe (PR #171). Each entry dispatches to the right inner backbone via backbone_registry.family_{liger,rmpad}_fn. For qwen3_vl / qwen3_vl_moe / qwen3_5 (still combined-style), the registry adapts via lambda with use_rmpad=True/False + per-flag toggles. qwen3_5_moe uses its native OV2 split entries directly.

kcz358 added 17 commits May 14, 2026 23:48

refactor(qwen3_moe): use varlen_attn(backend=...) for attn_implementa…

72fdca5

…tion portability

chore: ignore docs/superpowers (local plans/specs)

12063f1

feat(qwen3_5_moe): package skeleton with empty monkey patch entries

eeed7f8

feat(qwen3_5_moe): MoE-specific forwards; reuse qwen3_5 attn/linear_a…

21316ad

…ttn/text_model

refactor(qwen3_5): widen self: type hints to Union[dense, moe] for sh…

72381c3

…ared forwards

feat(qwen3_5_moe): liger + rmpad monkey patches (OV2-style split)

d61c416

feat(qwen3_5_moe): EP ParallelStyle (mirrors qwen3_moe with Qwen3_5Mo…

4324195

…eExperts)

feat(qwen3_5_moe): EP parallelize fn + register MODEL_TO_PARALLEL_METHOD

788d8e8

test(qwen3_5_moe): tiny-config EP smoke test (ep_degree 2/4/8)

b83f995

test(qwen3_5_moe): unittest wrapper + standardize train script argparse

cdfb637

docs(qwen3_5_moe): example yaml + run.sh + model doc with EP merger u…

1cdfd5e

…sage

kcz358 merged commit beac424 into main May 15, 2026
3 checks passed

kcz358 deleted the feat/qwen3-5-moe-ep branch May 15, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qwen3_5_moe): modeling + EP support + multi-axis DTensor merger#171

feat(qwen3_5_moe): modeling + EP support + multi-axis DTensor merger#171
kcz358 merged 17 commits into
mainfrom
feat/qwen3-5-moe-ep

kcz358 commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kcz358 commented May 15, 2026

Summary

Tests

Limitations

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant