We are training a standalone DFlash drafter for a Qwen3.5 9B target model. But after ~ 240step, layers.0.input_layernorm.weight got NaN, then dflash hidden became NaN. Could you help for what could led to that? Thanks
DFlash config:
- target model: Qwen3.5 9B
- DFlash hidden size: 4096
- DFlash layers: 5
- target context layers:
[1, 8, 15, 22, 29]
- block size: 16
- blocks per sequence: 512
- mask token id: 248063
- loss gamma: 7
- DFlash gradient checkpointing: enabled
- attention backend:
flash_attention_2
Training setup:
- 2 nodes x 8 A100
- bf16
- DeepSpeed ZeRO
- micro batch size per GPU: 64
- gradient accumulation: 1
- lr:
6e-4
- lr_min:
1e-6
- warmup steps: 2000
- scheduler: cosine-like decay
- gradient clipping: 1.0
- optimizer: AdamW, betas
(0.9, 0.98)
- max length: 8192
We are training a standalone DFlash drafter for a Qwen3.5 9B target model. But after ~ 240step,
layers.0.input_layernorm.weightgot NaN, then dflash hidden became NaN. Could you help for what could led to that? ThanksDFlash config:
[1, 8, 15, 22, 29]flash_attention_2Training setup:
6e-41e-6(0.9, 0.98)