Skip to content

feat(qwen3_5_moe): modeling + EP support + multi-axis DTensor merger#171

Merged
kcz358 merged 17 commits into
mainfrom
feat/qwen3-5-moe-ep
May 15, 2026
Merged

feat(qwen3_5_moe): modeling + EP support + multi-axis DTensor merger#171
kcz358 merged 17 commits into
mainfrom
feat/qwen3-5-moe-ep

Conversation

@kcz358
Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 commented May 15, 2026

Summary

  • Adds qwen3_5_moe (Qwen3.5-MoE / Qwen3.6-35B-A3B family) integration:
    • lmms_engine/models/qwen3_5_moe/: OV2-style split monkey patches (`apply_liger_kernel_to_qwen3_5_moe` + `apply_rmpad_to_qwen3_5_moe`), MoE-specific forwards (decoder w/ hybrid linear/full attention, sparse-moe block w/ shared expert, stacked-experts), and `lce_forward`.
    • Reuses qwen3_5 dense forwards (`attn_forward`, `linear_attn_forward`, `text_model_forward`, `model_forward`) by import — widened their `self:` type hints to `Union[dense, moe]`.
  • Expert Parallel:
    • `lmms_engine/parallel/qwen3_5_moe/`: `Qwen3_5MoeParallelStyle` + `apply_qwen3_5_moe_parallelize_fn` (FSDP wrap branches on `decoder_layer.layer_type` for linear_attn vs self_attn).
    • Registered under `MODEL_TO_PARALLEL_METHOD["qwen3_5_moe"]`.
  • Merger EP support (`src/lmms_engine/merger/fsdp2.py`):
    • `consolidate()` handles multi-axis DTensor (FSDP2 + EP) placements by mapping each rank to a mesh multi-index and folding along each axis's `Shard.dim` / Replicate.
    • Detects single-axis `Shard` with mesh_size=1 (e.g. `dp_shard_mod_ep=1` for non-expert params in an EP-only config) and treats as replicate to avoid spurious 4×-wide concat.
    • Avoids `mesh.mesh` (requires initialized PG offline); computes stride from mesh shape assuming default row-major rank layout.
  • API: `create_model_from_config` + merger CLI gain `model_general_type` override so configs registered under multiple AutoModel mappings (e.g. `Qwen3_5MoeConfig` is in both `causal_lm` and `image_text_to_text`) can be disambiguated.
  • Drive-by: `qwen3_moe_ops.attn_forward` switched from direct `flash_attn_varlen_func` to unified `varlen_attn(backend=self.config._attn_implementation)` for SDPA portability.

Tests

  • New CI/CD test `test/train/qwen3_5_moe/test_qwen3_5_moe.py`: tiny random-init Qwen3_5MoeForConditionalGeneration (~100M params), ep_degree=4, 10 steps end-to-end. Verified via `bash cicd/run_traincicd.sh --model-name qwen3_5_moe --gpu-count 4`: passes in ~56s.
  • Existing `test_qwen3_moe` still passes (drive-by varlen_attn refactor verified compatible).
  • Merger regression: consolidating an existing FSDP2 non-EP checkpoint (`MidTrain7to3-QuickSFT-3_5K-OpenMM-ColdStart/checkpoint-3000`, 8 shards) yields shapes matching the model config (vocab×hidden, vision dims, attention proj dims) — no behavioral change for non-EP ckpts.
  • Merger EP path: consolidating the tiny ep_degree=4 ckpt produces a valid `Qwen3_5MoeForConditionalGeneration` that reloads via `AutoModelForImageTextToText.from_pretrained` with experts.gate_up_proj fully reconstituted to `(num_experts, 2*I, H)`.

Limitations

  • No Ulysses SP support (`sp_ulysses_degree` must be 1). qwen3_5 linear-attention upstream path isn't SP-safe yet.
  • Production-scale `Qwen/Qwen3.6-35B-A3B` requires ≥ 8× A100 80G with `ep_degree=8`; smaller boxes can use lower EP.

Files

22 files changed, +1260 / -37. See commits for atomic, single-responsibility breakdown.

kcz358 added 17 commits May 14, 2026 23:48
…pe override

Mirrors create_model_from_pretrained — needed so users can disambiguate
configs registered under multiple AutoModel mappings (e.g. Qwen3_5MoeConfig
is in both causal_lm and image_text_to_text). Without the override
qwen3_5_moe silently falls through to AutoModelForCausalLM and we lose the
multimodal wrapper.
…guage_model.layers)

Qwen3_5MoeForConditionalGeneration's .model is the multimodal wrapper
Qwen3_5MoeModel (visual + language_model); decoder layers live one level
deeper at .language_model.layers. Mirror qwen3_vl_moe parallelize, which
has the same shape. Also wrap the vision tower.
…gits tuple)

Upstream Qwen3_5MoeSparseMoeBlock returns Tensor; the (Tensor, router_logits)
tuple from qwen3_moe's MoE block is qwen3_moe-specific and breaks the
qwen3_5 text_model_forward layer loop (which doesn't unpack tuples). Drop
the tuple so qwen3_5_moe can reuse text_model_forward as-is.
self.config on Qwen3_5MoeForConditionalGeneration is the multimodal
Qwen3_5MoeConfig; hidden_size/vocab_size live in self.config.text_config.
Fall back to text_config gracefully so both ForCausalLM (text-only) and
ForConditionalGeneration (multimodal) work.
- consolidate() now handles DTensor placements of any length: maps each
  rank to a multi-index on the device mesh (C-order stride) and folds
  shards along each axis's Shard.dim. Replicate axes take the first copy.
- For single-axis Shard placements where the mesh axis is size 1 (e.g.
  dp_shard_mod_ep=1 for non-expert params in an EP-only config), each
  rank holds a full copy; detect via local-shape == global-shape and
  treat as replicate to avoid 4x-wide concat artifacts.
- Avoid relying on mesh.mesh (requires initialized PG); compute stride
  from mesh shape assuming default row-major rank layout.
- merge() and CLI accept --model_general_type to override AutoModel
  class for multi-mapping configs (e.g. Qwen3_5MoeConfig is registered
  under both causal_lm and image_text_to_text).

Verified end-to-end on qwen3_5_moe ep_degree=4 tiny checkpoint:
merged model reloads as Qwen3_5MoeForConditionalGeneration with
experts.gate_up_proj fully consolidated to (num_experts, 2*I, H).
decoder_layer now returns (hidden, router_logits) tuple when requested;
new text_model_forward / model_forward collect router_logits across
layers and surface them on BaseModelOutputWithPastAndRmpad (which
already has the field). lce_forward reads aux-loss config from
text_config (matches upstream Qwen3_5MoeForConditionalGeneration
which doesn't expose num_experts/router_aux_loss_coef on self).

Without this, qwen3_5_moe training had no load-balancing pressure
and router could collapse to a few experts over long runs.
@kcz358 kcz358 merged commit beac424 into main May 15, 2026
3 checks passed
@kcz358 kcz358 deleted the feat/qwen3-5-moe-ep branch May 15, 2026 09:01
kcz358 added a commit that referenced this pull request May 15, 2026
Splits aero's monkey-patch into independent 'liger' and 'rmpad' entries
mirroring qwen3_5_moe (PR #171). Each entry dispatches to the right
inner backbone via backbone_registry.family_{liger,rmpad}_fn.

For qwen3_vl / qwen3_vl_moe / qwen3_5 (still combined-style), the
registry adapts via lambda with use_rmpad=True/False + per-flag toggles.
qwen3_5_moe uses its native OV2 split entries directly.
kcz358 added a commit that referenced this pull request May 18, 2026
Splits aero's monkey-patch into independent 'liger' and 'rmpad' entries
mirroring qwen3_5_moe (PR #171). Each entry dispatches to the right
inner backbone via backbone_registry.family_{liger,rmpad}_fn.

For qwen3_vl / qwen3_vl_moe / qwen3_5 (still combined-style), the
registry adapts via lambda with use_rmpad=True/False + per-flag toggles.
qwen3_5_moe uses its native OV2 split entries directly.
kcz358 added a commit that referenced this pull request May 19, 2026
Splits aero's monkey-patch into independent 'liger' and 'rmpad' entries
mirroring qwen3_5_moe (PR #171). Each entry dispatches to the right
inner backbone via backbone_registry.family_{liger,rmpad}_fn.

For qwen3_vl / qwen3_vl_moe / qwen3_5 (still combined-style), the
registry adapts via lambda with use_rmpad=True/False + per-flag toggles.
qwen3_5_moe uses its native OV2 split entries directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant