The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni1 Shenzhi Wang1 Yang Yue1 Tianyu Yu2 Weilin Zhao2 Yeguo Hua3
Tianyi Chen3 Jun Song4 Cheng Yu4 Bo Zheng4 Gao Huang1✉
1LeapLab, Tsinghua University 2NLPLab, Tsinghua University 3Tsinghua University 4Alibaba Group
No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.
Just GRPO.
- [2026.01] 📄 Paper available on arXiv!
- [2026.01] 🎉 Training code, evaluation scripts, and model checkpoint on GSM8K released!
- Add support for MATH-500
- Add support for coding tasks (HumanEval, MBPP)
Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?
We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") — the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).
Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.
JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:
Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:
| Challenge | Description |
|---|---|
| Combinatorial trajectories | Optimizing over factorial-sized denoising paths |
| Intractable likelihoods | ELBO-based surrogates instead of true objectives |
| Sampler-learner mismatch | Confidence-based samplers vs. original diffusion prior |
- JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
- The core logic of JustGRPO (
grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.
💡 The model still retains parallel decoding at inference time — we only use AR order during training. See our paper for more details.
JustGRPO is designed to be lightweight and dependency-minimal.
git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txtDependencies:
acceleratetransformersdatasets- Standard evaluation utilities (
sympy,latex2sympy2, etc.)
We provide training and evaluation code on GSM8K. The RL-trained model is available at Huggingface.
accelerate launch --num_processes 8 --main_process_ip localhost --config_file configs/fsdp.yaml train.py \
--run_dir ./checkpoints \
--grad_accum 8Note: Keep global batch size =
num_gpus×grad_accum= 64.Adjust
--grad_accumbased on your GPU count (e.g., 16 GPUs →--grad_accum 4).
torchrun --standalone --nproc-per-node=8 eval.py \
--ckpt_path /path/to/ckpt \
--steps 256 \
--gen_length 256 \
--block_length 32If you find this work useful, please cite:
@article{ni2026flexibility,
title={The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models},
author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
journal={arXiv preprint arXiv:2601.15165},
year={2026}
}This project builds upon the following excellent works:
We thank the authors for their open-source contributions to the community.

