Add Qwen3.5 models by YunchaoYang · Pull Request #1500 · facebookresearch/fairseq2

YunchaoYang · 2026-03-25T17:01:53Z

Adds full Qwen 3.5 model family support to fairseq2 — dense models (0.8B, 2B, 9B, 27B) variants, including model architecture, tokenizer integration, HuggingFace interoperation, and a validated SFT training recipe.

Architecture

1. Hybrid Attention

Qwen 3.5 is a hybrid Hybrid Linear + Full Attention model that alternates two layer types:

Gated Linear Attention (75%): Gated DeltaNet — causal convolution + gated delta rule recurrence. Constant memory during decoding (no KV cache growth).
Gated Full Attention (25%): Standard multi-head attention with output gating, partial rotary embeddings (64/256 dims), and QK-norm.

Every 4th layer is full attention; the rest are linear attention.

2. (1+weight) RMSNorm

All RMSNorms use norm(x) * (1 + weight) with weights initialized to zeros, instead of the standard norm(x) * weight with weights initialized to ones. Mathematically equivalent at init, but different parameter space — this required special handling in the state dict converter (weight += 1.0 on import, weight -= 1.0 on export).

3 Partial Rotary Embeddings

Only 25% of head dimensions get rotary position encoding (64 of 256). The other 192 dims are position-independent, acting as retrieval-like features. Most other models rotate all dims.

MoE Support (configs and inference only)

moe.py — Qwen35TopKRouter (softmax → top-k → renormalize), Qwen35Experts (3D parameter experts), Qwen35MoeBlock (drop-in FFN replacement with shared expert)
Qwen35MoeConfig(Qwen35Config) + Qwen35MoeFactory(Qwen35Factory)
_Qwen35MoeHuggingFaceConverter and asset cards for 35B-A3B
Note: MoE variant configs are available for inference, but MoE training/SFT is not yet supported due to lack of MoE parallelization support in fairseq2.

SFT Training Recipe

qwen35_0.8b_gsm8k.yaml — SFT config on GSM8K with FSDP, bfloat16, cosine annealing

Tests:

29 component unit tests passed;
Parity test passed (logit diff < 1e-4): Validated inference against Huggingface Transformers on the Qwen/Qwen3.5-0.8B model;
Dense SFT recipe validated end-to-end on 8-GPU SLURM node

Fixes #1497

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

… Qwen35DecoderLayer - gated_delta_net.py: GatedDeltaNet module with IncrementalState, RMSNormGated, and PyTorch fallback kernels (chunk/recurrent delta rule, causal conv1d) - attention.py: Qwen35Attention with doubled Q projection, output gating, partial RoPE (64/256 dims), QK-norm, GQA - decoder_layer.py: Qwen35DecoderLayer with hybrid dispatch (full_attention → Qwen35Attention, linear_attention → GatedDeltaNet)

- moe.py: Qwen35TopKRouter (softmax + top-k + renormalize), Qwen35Experts (3D param fused experts with SiLU-gated MLP), Qwen35MoeBlock(FeedForwardNetwork) drop-in FFN replacement with shared expert + sigmoid gate blending - Router uses softmax (NOT sigmoid like Llama4) - Expert weights stored as 3D Parameters for per-expert indexing

- test_gated_delta_net.py (6 tests): forward shape, prefill parity, step-by-step decode, chunked vs recurrent consistency, state reorder, RMSNormGated correctness - test_qwen35_attention.py (6 tests): forward shape, output gating, partial RoPE, GQA, QK-norm effect, incremental KV cache decode - test_qwen35_interop.py (4 tests): state dict key round-trip, RMSNorm weight+1 conversion, GDN norm NOT converted, layer types - test_qwen35_decoder_layer.py (5 tests): full/linear attention layers, invalid type error, factory e2e model creation, hybrid layer pattern - test_qwen35_moe.py (8 tests): router shapes/softmax/renorm, expert shapes/weighted output, MoE block shape, shared expert, FeedForwardNetwork inheritance

- qwen35_architecture.md: Theory & HF reference analysis (hybrid layers, partial RoPE, (1+w) RMSNorm, GatedDeltaNet, MoE, dual cache) - qwen35_implementation_plan.md: 5-phase implementation plan with code sketches - qwen35_key_decisions.md: 6 design decisions with alternatives, tradeoffs, FSDP/TP analysis, and OLMo2 RMSNorm comparison - qwen35_progress.md: Implementation tracker (Phases 1-4 complete)

- Fix SDPA call: pass q_layout/k_layout positional args, unpack tuple return - Fix parent constructor: MultiheadAttention.__init__() takes no arguments - Fix import: repeat_interleave from fairseq2.ops (not fairseq2.nn.functional) - Fix incremental KV cache test: use CausalAttentionBias (not IdentityBias)

- Preserve raw logits in router_logits, apply softmax to separate router_probs - Top-k selection now uses router_probs (post-softmax), not raw logits - Return tuple's first element is now raw pre-softmax logits for future load-balancing loss support - Update test: verify logits do NOT sum to 1 and weights DO sum to 1

- Add qwen35_0.8b arch config matching HF Qwen3.5-0.8B exactly: model_dim=1024, num_layers=24, ffn_inner_dim=3584, rope_theta=10M, linear_num_key_heads=16, linear_num_value_heads=16, tied_embeddings=True - Add asset cards for qwen35_0.8b, qwen35_27b, qwen35_moe_35b_a3b with HuggingFace checkpoint/tokenizer URIs

- Add TODO comments after Qwen 3.5 dense and MoE family registration - Tracks that HF export support requires reverse RMSNorm conversion (weight -= 1.0)

Root cause of HF parity failure: Qwen3_5RMSNorm uses (1+weight) formula for ALL norms, including q_norm and k_norm in attention layers. Only the layer norms (input_layernorm, post_attention_layernorm, model.norm) were being converted with weight += 1.0 — q_norm and k_norm were missed. Without conversion, q_norm computes norm(x) * 0.43 instead of norm(x) * 1.43, a 3.3x scaling error that compounds through all 6 full-attention layers. - Add 'self_attn.q_norm.weight' and 'self_attn.k_norm.weight' to _QWEN35_RMSNORM_KEYS - Fix interop test: reset layer_types=None before __post_init__ regeneration

- Add end-to-end parity test: loads Qwen3.5-0.8B from HuggingFace, converts state dict, runs inference, asserts logit closeness (atol=1e-4) - Add parity investigation writeup documenting the debugging methodology: layer-by-layer comparison, M-RoPE analysis (confirmed as no-op for text), sub-operation comparison isolating q_norm/k_norm as root cause - Test result: max abs diff 7.63e-06, cosine sim 1.0, top-5 token match

…ress Dense archs (from /checkpoint/smallomnillm/shared/models/): - qwen35_0.8b: h=1024 L=24 H=8 kv=2 ffn=3584 tied=True lvh=16 - qwen35_2b: h=2048 L=24 H=8 kv=2 ffn=6144 tied=True lvh=16 - qwen35_9b: h=4096 L=32 H=16 kv=4 ffn=12288 tied=False lvh=32 - qwen35_27b: h=5120 L=64 H=24 kv=4 ffn=17408 tied=False lvh=48 MoE arch: - qwen35_moe_35b_a3b: h=2048 L=40 H=16 kv=2 E=256 K=8 Asset cards: 8 entries (0.8b, 2b, 2b_base, 9b, 9b_base, 27b, moe_35b_a3b, moe_35b_a3b_base) Progress doc: updated through Phase 5 with all configs and parity results

…path kernels, sharder - Add _Qwen35HuggingFaceConverter and _Qwen35MoeHuggingFaceConverter (interop.py) - Register HF converters in composition/models.py for qwen3_5 and qwen3_5_moe families - Add get_qwen35_moe_model_hub and get_qwen35_moe_tokenizer_hub (hub.py) - Add conditional imports for causal_conv1d and fla fast-path kernels (gated_delta_net.py) - Add get_qwen35_shard_specs() for deprecated-style FSDP sharding (sharder.py) - Export all new symbols from __init__.py - Add 6 unit tests for HF converters (config mapping, state dict round-trip, RMSNorm reversal, tied embeddings) - Fix MoE key map: mlp.gate -> ffn.gate (matching actual Qwen35MoeBlock attribute name) - Update progress doc to mark Phase 6 complete 35/35 unit tests passing.

- Add SFT YAML config (qwen35_0.8b_gsm8k.yaml) with model, tokenizer, dataset, optimizer, and scheduler settings - Add convenience run script (run_qwen35_gsm8k.sh) - Add pad_idx field to QwenConfig and Qwen35Config for SFT compatibility - Wire pad_idx through Qwen35Factory to embedding and TransformerLM - Set tokenizer use_im_end=true so eos_idx != pad_idx (required by VocabularyInfo)

…n failure - Fix 10 mypy errors across 4 files: - gated_delta_net.py: remove unused import, fix None->Tensor assignment - moe.py: add explicit Tensor type annotation (no-any-return) - factory.py: narrow return types to Qwen35Attention/GatedDeltaNet - test_gated_delta_net.py: add assert-not-None guards - Add mypy override for optional deps (causal_conv1d, fla) in pyproject.toml - Move parity test to tests/integration/models/test_qwen35.py with pytest.mark.skipif guard for unsupported transformers versions - Run isort + black formatting on all changed files

…ttention.py - Remove tests/parity/test_qwen35_hf_parity.py (moved to tests/integration/models/test_qwen35.py) - Remove unused 'torch.nn' import in attention.py (flake8 F401)

…eb-Edu 10BT - Add training config (qwen35_0.8b_fineweb_edu_10bt.yaml): continued pretraining with lr=5e-5, cosine LR, FSDP v2, activation checkpointing, 76K steps - Add test config (test_qwen35_0.8b.yaml): single-GPU smoke test, 100 steps - Add SLURM script (run_qwen35_fineweb_edu.sh): 1 node x 8 GPU, 48h - Fix GatedDeltaNet packed sequence support in decoder_layer.py: train recipe hardcodes sequence packing (packed=True), producing 2D tensors. GatedDeltaNet expects 3D (B, S, D). Added unsqueeze/squeeze to handle packed format. Non-packed paths (SFT, inference) unaffected. - Add docs (qwen35_continued_pretraining.md)

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026

YunchaoYang marked this pull request as ready for review March 31, 2026 19:23

YunchaoYang requested review from MartinGleize, cbalioglu, cirquit and zyaoj as code owners March 31, 2026 19:23

YunchaoYang added 21 commits April 1, 2026 21:56

update plan

20f868b

[qwen3.5] Add Phase 5 TODOs for HuggingFaceConverter registration

526cd53

- Add TODO comments after Qwen 3.5 dense and MoE family registration - Tracks that HF export support requires reverse RMSNorm conversion (weight -= 1.0)

rm unnecessary file

44fa587

stripe image encoder for loading text model only

d734951

add interop to qwen

f3a24d7

[qwen3.5] Fix CI: remove old parity test path, fix unused import in a…

d372d27

…ttention.py - Remove tests/parity/test_qwen35_hf_parity.py (moved to tests/integration/models/test_qwen35.py) - Remove unused 'torch.nn' import in attention.py (flake8 F401)

fix lint

04e543c

clean up docs

85e5350

YunchaoYang force-pushed the yy/qwen35 branch from a5618db to 6737510 Compare April 1, 2026 21:58

fix lint

8ebe8d5

YunchaoYang force-pushed the yy/qwen35 branch from 6737510 to 8ebe8d5 Compare April 1, 2026 22:01

YunchaoYang added 3 commits April 2, 2026 21:33

fix lint and format

5e9fe7d

add TP guard and import warning

3d1df7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5 models#1500

Add Qwen3.5 models#1500
YunchaoYang wants to merge 25 commits intomainfrom
yy/qwen35

YunchaoYang commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YunchaoYang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

1. Hybrid Attention

2. (1+weight) RMSNorm

3 Partial Rotary Embeddings

MoE Support (configs and inference only)

SFT Training Recipe

Tests:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YunchaoYang commented Mar 25, 2026 •

edited

Loading