Open
Conversation
… Qwen35DecoderLayer - gated_delta_net.py: GatedDeltaNet module with IncrementalState, RMSNormGated, and PyTorch fallback kernels (chunk/recurrent delta rule, causal conv1d) - attention.py: Qwen35Attention with doubled Q projection, output gating, partial RoPE (64/256 dims), QK-norm, GQA - decoder_layer.py: Qwen35DecoderLayer with hybrid dispatch (full_attention → Qwen35Attention, linear_attention → GatedDeltaNet)
- moe.py: Qwen35TopKRouter (softmax + top-k + renormalize), Qwen35Experts (3D param fused experts with SiLU-gated MLP), Qwen35MoeBlock(FeedForwardNetwork) drop-in FFN replacement with shared expert + sigmoid gate blending - Router uses softmax (NOT sigmoid like Llama4) - Expert weights stored as 3D Parameters for per-expert indexing
- test_gated_delta_net.py (6 tests): forward shape, prefill parity, step-by-step decode, chunked vs recurrent consistency, state reorder, RMSNormGated correctness - test_qwen35_attention.py (6 tests): forward shape, output gating, partial RoPE, GQA, QK-norm effect, incremental KV cache decode - test_qwen35_interop.py (4 tests): state dict key round-trip, RMSNorm weight+1 conversion, GDN norm NOT converted, layer types - test_qwen35_decoder_layer.py (5 tests): full/linear attention layers, invalid type error, factory e2e model creation, hybrid layer pattern - test_qwen35_moe.py (8 tests): router shapes/softmax/renorm, expert shapes/weighted output, MoE block shape, shared expert, FeedForwardNetwork inheritance
- qwen35_architecture.md: Theory & HF reference analysis (hybrid layers, partial RoPE, (1+w) RMSNorm, GatedDeltaNet, MoE, dual cache) - qwen35_implementation_plan.md: 5-phase implementation plan with code sketches - qwen35_key_decisions.md: 6 design decisions with alternatives, tradeoffs, FSDP/TP analysis, and OLMo2 RMSNorm comparison - qwen35_progress.md: Implementation tracker (Phases 1-4 complete)
- Fix SDPA call: pass q_layout/k_layout positional args, unpack tuple return - Fix parent constructor: MultiheadAttention.__init__() takes no arguments - Fix import: repeat_interleave from fairseq2.ops (not fairseq2.nn.functional) - Fix incremental KV cache test: use CausalAttentionBias (not IdentityBias)
- Preserve raw logits in router_logits, apply softmax to separate router_probs - Top-k selection now uses router_probs (post-softmax), not raw logits - Return tuple's first element is now raw pre-softmax logits for future load-balancing loss support - Update test: verify logits do NOT sum to 1 and weights DO sum to 1
- Add qwen35_0.8b arch config matching HF Qwen3.5-0.8B exactly: model_dim=1024, num_layers=24, ffn_inner_dim=3584, rope_theta=10M, linear_num_key_heads=16, linear_num_value_heads=16, tied_embeddings=True - Add asset cards for qwen35_0.8b, qwen35_27b, qwen35_moe_35b_a3b with HuggingFace checkpoint/tokenizer URIs
- Add TODO comments after Qwen 3.5 dense and MoE family registration - Tracks that HF export support requires reverse RMSNorm conversion (weight -= 1.0)
Root cause of HF parity failure: Qwen3_5RMSNorm uses (1+weight) formula for ALL norms, including q_norm and k_norm in attention layers. Only the layer norms (input_layernorm, post_attention_layernorm, model.norm) were being converted with weight += 1.0 — q_norm and k_norm were missed. Without conversion, q_norm computes norm(x) * 0.43 instead of norm(x) * 1.43, a 3.3x scaling error that compounds through all 6 full-attention layers. - Add 'self_attn.q_norm.weight' and 'self_attn.k_norm.weight' to _QWEN35_RMSNORM_KEYS - Fix interop test: reset layer_types=None before __post_init__ regeneration
- Add end-to-end parity test: loads Qwen3.5-0.8B from HuggingFace, converts state dict, runs inference, asserts logit closeness (atol=1e-4) - Add parity investigation writeup documenting the debugging methodology: layer-by-layer comparison, M-RoPE analysis (confirmed as no-op for text), sub-operation comparison isolating q_norm/k_norm as root cause - Test result: max abs diff 7.63e-06, cosine sim 1.0, top-5 token match
…ress Dense archs (from /checkpoint/smallomnillm/shared/models/): - qwen35_0.8b: h=1024 L=24 H=8 kv=2 ffn=3584 tied=True lvh=16 - qwen35_2b: h=2048 L=24 H=8 kv=2 ffn=6144 tied=True lvh=16 - qwen35_9b: h=4096 L=32 H=16 kv=4 ffn=12288 tied=False lvh=32 - qwen35_27b: h=5120 L=64 H=24 kv=4 ffn=17408 tied=False lvh=48 MoE arch: - qwen35_moe_35b_a3b: h=2048 L=40 H=16 kv=2 E=256 K=8 Asset cards: 8 entries (0.8b, 2b, 2b_base, 9b, 9b_base, 27b, moe_35b_a3b, moe_35b_a3b_base) Progress doc: updated through Phase 5 with all configs and parity results
…path kernels, sharder - Add _Qwen35HuggingFaceConverter and _Qwen35MoeHuggingFaceConverter (interop.py) - Register HF converters in composition/models.py for qwen3_5 and qwen3_5_moe families - Add get_qwen35_moe_model_hub and get_qwen35_moe_tokenizer_hub (hub.py) - Add conditional imports for causal_conv1d and fla fast-path kernels (gated_delta_net.py) - Add get_qwen35_shard_specs() for deprecated-style FSDP sharding (sharder.py) - Export all new symbols from __init__.py - Add 6 unit tests for HF converters (config mapping, state dict round-trip, RMSNorm reversal, tied embeddings) - Fix MoE key map: mlp.gate -> ffn.gate (matching actual Qwen35MoeBlock attribute name) - Update progress doc to mark Phase 6 complete 35/35 unit tests passing.
- Add SFT YAML config (qwen35_0.8b_gsm8k.yaml) with model, tokenizer, dataset, optimizer, and scheduler settings - Add convenience run script (run_qwen35_gsm8k.sh) - Add pad_idx field to QwenConfig and Qwen35Config for SFT compatibility - Wire pad_idx through Qwen35Factory to embedding and TransformerLM - Set tokenizer use_im_end=true so eos_idx != pad_idx (required by VocabularyInfo)
…n failure - Fix 10 mypy errors across 4 files: - gated_delta_net.py: remove unused import, fix None->Tensor assignment - moe.py: add explicit Tensor type annotation (no-any-return) - factory.py: narrow return types to Qwen35Attention/GatedDeltaNet - test_gated_delta_net.py: add assert-not-None guards - Add mypy override for optional deps (causal_conv1d, fla) in pyproject.toml - Move parity test to tests/integration/models/test_qwen35.py with pytest.mark.skipif guard for unsupported transformers versions - Run isort + black formatting on all changed files
…ttention.py - Remove tests/parity/test_qwen35_hf_parity.py (moved to tests/integration/models/test_qwen35.py) - Remove unused 'torch.nn' import in attention.py (flake8 F401)
…eb-Edu 10BT - Add training config (qwen35_0.8b_fineweb_edu_10bt.yaml): continued pretraining with lr=5e-5, cosine LR, FSDP v2, activation checkpointing, 76K steps - Add test config (test_qwen35_0.8b.yaml): single-GPU smoke test, 100 steps - Add SLURM script (run_qwen35_fineweb_edu.sh): 1 node x 8 GPU, 48h - Fix GatedDeltaNet packed sequence support in decoder_layer.py: train recipe hardcodes sequence packing (packed=True), producing 2D tensors. GatedDeltaNet expects 3D (B, S, D). Added unsqueeze/squeeze to handle packed format. Non-packed paths (SFT, inference) unaffected. - Add docs (qwen35_continued_pretraining.md)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds full Qwen 3.5 model family support to fairseq2 — dense models (0.8B, 2B, 9B, 27B) variants, including model architecture, tokenizer integration, HuggingFace interoperation, and a validated SFT training recipe.
Architecture
1. Hybrid Attention
Qwen 3.5 is a hybrid Hybrid Linear + Full Attention model that alternates two layer types:
Every 4th layer is full attention; the rest are linear attention.
2. (1+weight) RMSNorm
All RMSNorms use norm(x) * (1 + weight) with weights initialized to zeros, instead of the standard norm(x) * weight with weights initialized to ones. Mathematically equivalent at init, but different parameter space — this required special handling in the state dict converter (weight += 1.0 on import, weight -= 1.0 on export).
3 Partial Rotary Embeddings
Only 25% of head dimensions get rotary position encoding (64 of 256). The other 192 dims are position-independent, acting as retrieval-like features. Most other models rotate all dims.
MoE Support (configs and inference only)
moe.py — Qwen35TopKRouter (softmax → top-k → renormalize), Qwen35Experts (3D parameter experts), Qwen35MoeBlock (drop-in FFN replacement with shared expert)
Qwen35MoeConfig(Qwen35Config) + Qwen35MoeFactory(Qwen35Factory)
_Qwen35MoeHuggingFaceConverter and asset cards for 35B-A3B
Note: MoE variant configs are available for inference, but MoE training/SFT is not yet supported due to lack of MoE parallelization support in fairseq2.
SFT Training Recipe
qwen35_0.8b_gsm8k.yaml— SFT config on GSM8K with FSDP, bfloat16, cosine annealingTests:
Fixes #1497
Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.
Check list: