Skip to content

Add Qwen3.5 models#1500

Open
YunchaoYang wants to merge 25 commits intomainfrom
yy/qwen35
Open

Add Qwen3.5 models#1500
YunchaoYang wants to merge 25 commits intomainfrom
yy/qwen35

Conversation

@YunchaoYang
Copy link
Copy Markdown
Contributor

@YunchaoYang YunchaoYang commented Mar 25, 2026

Adds full Qwen 3.5 model family support to fairseq2 — dense models (0.8B, 2B, 9B, 27B) variants, including model architecture, tokenizer integration, HuggingFace interoperation, and a validated SFT training recipe.

Architecture

1. Hybrid Attention

Qwen 3.5 is a hybrid Hybrid Linear + Full Attention model that alternates two layer types:

  • Gated Linear Attention (75%): Gated DeltaNet — causal convolution + gated delta rule recurrence. Constant memory during decoding (no KV cache growth).
  • Gated Full Attention (25%): Standard multi-head attention with output gating, partial rotary embeddings (64/256 dims), and QK-norm.

Every 4th layer is full attention; the rest are linear attention.

2. (1+weight) RMSNorm

All RMSNorms use norm(x) * (1 + weight) with weights initialized to zeros, instead of the standard norm(x) * weight with weights initialized to ones. Mathematically equivalent at init, but different parameter space — this required special handling in the state dict converter (weight += 1.0 on import, weight -= 1.0 on export).

3 Partial Rotary Embeddings

Only 25% of head dimensions get rotary position encoding (64 of 256). The other 192 dims are position-independent, acting as retrieval-like features. Most other models rotate all dims.

MoE Support (configs and inference only)

moe.py — Qwen35TopKRouter (softmax → top-k → renormalize), Qwen35Experts (3D parameter experts), Qwen35MoeBlock (drop-in FFN replacement with shared expert)
Qwen35MoeConfig(Qwen35Config) + Qwen35MoeFactory(Qwen35Factory)
_Qwen35MoeHuggingFaceConverter and asset cards for 35B-A3B
Note: MoE variant configs are available for inference, but MoE training/SFT is not yet supported due to lack of MoE parallelization support in fairseq2.

SFT Training Recipe

qwen35_0.8b_gsm8k.yaml — SFT config on GSM8K with FSDP, bfloat16, cosine annealing

Tests:

  1. 29 component unit tests passed;
  2. Parity test passed (logit diff < 1e-4): Validated inference against Huggingface Transformers on the Qwen/Qwen3.5-0.8B model;
  3. Dense SFT recipe validated end-to-end on 8-GPU SLURM node

Fixes #1497

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026
@YunchaoYang YunchaoYang marked this pull request as ready for review March 31, 2026 19:23
… Qwen35DecoderLayer

- gated_delta_net.py: GatedDeltaNet module with IncrementalState, RMSNormGated,
  and PyTorch fallback kernels (chunk/recurrent delta rule, causal conv1d)
- attention.py: Qwen35Attention with doubled Q projection, output gating,
  partial RoPE (64/256 dims), QK-norm, GQA
- decoder_layer.py: Qwen35DecoderLayer with hybrid dispatch
  (full_attention → Qwen35Attention, linear_attention → GatedDeltaNet)
- moe.py: Qwen35TopKRouter (softmax + top-k + renormalize),
  Qwen35Experts (3D param fused experts with SiLU-gated MLP),
  Qwen35MoeBlock(FeedForwardNetwork) drop-in FFN replacement
  with shared expert + sigmoid gate blending
- Router uses softmax (NOT sigmoid like Llama4)
- Expert weights stored as 3D Parameters for per-expert indexing
- test_gated_delta_net.py (6 tests): forward shape, prefill parity,
  step-by-step decode, chunked vs recurrent consistency, state reorder,
  RMSNormGated correctness
- test_qwen35_attention.py (6 tests): forward shape, output gating,
  partial RoPE, GQA, QK-norm effect, incremental KV cache decode
- test_qwen35_interop.py (4 tests): state dict key round-trip,
  RMSNorm weight+1 conversion, GDN norm NOT converted, layer types
- test_qwen35_decoder_layer.py (5 tests): full/linear attention layers,
  invalid type error, factory e2e model creation, hybrid layer pattern
- test_qwen35_moe.py (8 tests): router shapes/softmax/renorm,
  expert shapes/weighted output, MoE block shape, shared expert,
  FeedForwardNetwork inheritance
- qwen35_architecture.md: Theory & HF reference analysis (hybrid layers,
  partial RoPE, (1+w) RMSNorm, GatedDeltaNet, MoE, dual cache)
- qwen35_implementation_plan.md: 5-phase implementation plan with code sketches
- qwen35_key_decisions.md: 6 design decisions with alternatives, tradeoffs,
  FSDP/TP analysis, and OLMo2 RMSNorm comparison
- qwen35_progress.md: Implementation tracker (Phases 1-4 complete)
- Fix SDPA call: pass q_layout/k_layout positional args, unpack tuple return
- Fix parent constructor: MultiheadAttention.__init__() takes no arguments
- Fix import: repeat_interleave from fairseq2.ops (not fairseq2.nn.functional)
- Fix incremental KV cache test: use CausalAttentionBias (not IdentityBias)
- Preserve raw logits in router_logits, apply softmax to separate router_probs
- Top-k selection now uses router_probs (post-softmax), not raw logits
- Return tuple's first element is now raw pre-softmax logits for future
  load-balancing loss support
- Update test: verify logits do NOT sum to 1 and weights DO sum to 1
- Add qwen35_0.8b arch config matching HF Qwen3.5-0.8B exactly:
  model_dim=1024, num_layers=24, ffn_inner_dim=3584, rope_theta=10M,
  linear_num_key_heads=16, linear_num_value_heads=16, tied_embeddings=True
- Add asset cards for qwen35_0.8b, qwen35_27b, qwen35_moe_35b_a3b
  with HuggingFace checkpoint/tokenizer URIs
- Add TODO comments after Qwen 3.5 dense and MoE family registration
- Tracks that HF export support requires reverse RMSNorm conversion (weight -= 1.0)
Root cause of HF parity failure: Qwen3_5RMSNorm uses (1+weight) formula for
ALL norms, including q_norm and k_norm in attention layers. Only the layer
norms (input_layernorm, post_attention_layernorm, model.norm) were being
converted with weight += 1.0 — q_norm and k_norm were missed.

Without conversion, q_norm computes norm(x) * 0.43 instead of norm(x) * 1.43,
a 3.3x scaling error that compounds through all 6 full-attention layers.

- Add 'self_attn.q_norm.weight' and 'self_attn.k_norm.weight' to
  _QWEN35_RMSNORM_KEYS
- Fix interop test: reset layer_types=None before __post_init__ regeneration
- Add end-to-end parity test: loads Qwen3.5-0.8B from HuggingFace, converts
  state dict, runs inference, asserts logit closeness (atol=1e-4)
- Add parity investigation writeup documenting the debugging methodology:
  layer-by-layer comparison, M-RoPE analysis (confirmed as no-op for text),
  sub-operation comparison isolating q_norm/k_norm as root cause
- Test result: max abs diff 7.63e-06, cosine sim 1.0, top-5 token match
…ress

Dense archs (from /checkpoint/smallomnillm/shared/models/):
  - qwen35_0.8b: h=1024 L=24 H=8 kv=2 ffn=3584 tied=True lvh=16
  - qwen35_2b:   h=2048 L=24 H=8 kv=2 ffn=6144 tied=True lvh=16
  - qwen35_9b:   h=4096 L=32 H=16 kv=4 ffn=12288 tied=False lvh=32
  - qwen35_27b:  h=5120 L=64 H=24 kv=4 ffn=17408 tied=False lvh=48
MoE arch:
  - qwen35_moe_35b_a3b: h=2048 L=40 H=16 kv=2 E=256 K=8

Asset cards: 8 entries (0.8b, 2b, 2b_base, 9b, 9b_base, 27b, moe_35b_a3b, moe_35b_a3b_base)
Progress doc: updated through Phase 5 with all configs and parity results
…path kernels, sharder

- Add _Qwen35HuggingFaceConverter and _Qwen35MoeHuggingFaceConverter (interop.py)
- Register HF converters in composition/models.py for qwen3_5 and qwen3_5_moe families
- Add get_qwen35_moe_model_hub and get_qwen35_moe_tokenizer_hub (hub.py)
- Add conditional imports for causal_conv1d and fla fast-path kernels (gated_delta_net.py)
- Add get_qwen35_shard_specs() for deprecated-style FSDP sharding (sharder.py)
- Export all new symbols from __init__.py
- Add 6 unit tests for HF converters (config mapping, state dict round-trip, RMSNorm reversal, tied embeddings)
- Fix MoE key map: mlp.gate -> ffn.gate (matching actual Qwen35MoeBlock attribute name)
- Update progress doc to mark Phase 6 complete

35/35 unit tests passing.
- Add SFT YAML config (qwen35_0.8b_gsm8k.yaml) with model, tokenizer,
  dataset, optimizer, and scheduler settings
- Add convenience run script (run_qwen35_gsm8k.sh)
- Add pad_idx field to QwenConfig and Qwen35Config for SFT compatibility
- Wire pad_idx through Qwen35Factory to embedding and TransformerLM
- Set tokenizer use_im_end=true so eos_idx != pad_idx (required by
  VocabularyInfo)
…n failure

- Fix 10 mypy errors across 4 files:
  - gated_delta_net.py: remove unused import, fix None->Tensor assignment
  - moe.py: add explicit Tensor type annotation (no-any-return)
  - factory.py: narrow return types to Qwen35Attention/GatedDeltaNet
  - test_gated_delta_net.py: add assert-not-None guards
- Add mypy override for optional deps (causal_conv1d, fla) in pyproject.toml
- Move parity test to tests/integration/models/test_qwen35.py with
  pytest.mark.skipif guard for unsupported transformers versions
- Run isort + black formatting on all changed files
…ttention.py

- Remove tests/parity/test_qwen35_hf_parity.py (moved to tests/integration/models/test_qwen35.py)
- Remove unused 'torch.nn' import in attention.py (flake8 F401)
…eb-Edu 10BT

- Add training config (qwen35_0.8b_fineweb_edu_10bt.yaml): continued pretraining
  with lr=5e-5, cosine LR, FSDP v2, activation checkpointing, 76K steps
- Add test config (test_qwen35_0.8b.yaml): single-GPU smoke test, 100 steps
- Add SLURM script (run_qwen35_fineweb_edu.sh): 1 node x 8 GPU, 48h
- Fix GatedDeltaNet packed sequence support in decoder_layer.py:
  train recipe hardcodes sequence packing (packed=True), producing 2D tensors.
  GatedDeltaNet expects 3D (B, S, D). Added unsqueeze/squeeze to handle packed
  format. Non-packed paths (SFT, inference) unaffected.
- Add docs (qwen35_continued_pretraining.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Qwen3.5 model

1 participant