JustGRPO

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Zanlin Ni¹ Shenzhi Wang¹ Yang Yue¹ Tianyu Yu² Weilin Zhao² Yeguo Hua³

Tianyi Chen³ Jun Song⁴ Cheng Yu⁴ Bo Zheng⁴ Gao Huang^1✉

¹LeapLab, Tsinghua University ²NLPLab, Tsinghua University ³Tsinghua University ⁴Alibaba Group

No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.

Just GRPO.

📢 News

[2026.01] 📄 Paper available on arXiv!
[2026.01] 🎉 Training code, evaluation scripts, and model checkpoint on GSM8K released!

📋 TODO

Add support for MATH-500
Add support for coding tasks (HumanEval, MBPP)

Why JustGRPO?

Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?

We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") — the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).

Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.

Results

JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:

Simplicity

Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:

Challenge	Description
Combinatorial trajectories	Optimizing over factorial-sized denoising paths
Intractable likelihoods	ELBO-based surrogates instead of true objectives
Sampler-learner mismatch	Confidence-based samplers vs. original diffusion prior

JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
The core logic of JustGRPO (grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.

💡 The model still retains parallel decoding at inference time — we only use AR order during training. See our paper for more details.

Installation

JustGRPO is designed to be lightweight and dependency-minimal.

git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txt

Dependencies:

accelerate
transformers
datasets
Standard evaluation utilities (sympy, latex2sympy2, etc.)

Usage

We provide training and evaluation code on GSM8K. The RL-trained model is available at Huggingface.

Training

accelerate launch --num_processes 8 --main_process_ip localhost --config_file configs/fsdp.yaml train.py \
  --run_dir ./checkpoints \
  --grad_accum 8

Note: Keep global batch size = num_gpus × grad_accum = 64.

Adjust --grad_accum based on your GPU count (e.g., 16 GPUs → --grad_accum 4).

Evaluation

torchrun --standalone --nproc-per-node=8 eval.py \
  --ckpt_path /path/to/ckpt \
  --steps 256 \
  --gen_length 256 \
  --block_length 32

Citation

If you find this work useful, please cite:

@article{ni2026flexibility,
  title={The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models},
  author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
  journal={arXiv preprint arXiv:2601.15165},
  year={2026}
}

Acknowledgments

This project builds upon the following excellent works:

We thank the authors for their open-source contributions to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
data		data
utils		utils
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
grpo.py		grpo.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JustGRPO

📢 News

📋 TODO

Why JustGRPO?

Results

Simplicity

Installation

Usage

Training

Evaluation

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

LeapLabTHU/JustGRPO

Folders and files

Latest commit

History

Repository files navigation

JustGRPO

📢 News

📋 TODO

Why JustGRPO?

Results

Simplicity

Installation

Usage

Training

Evaluation

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages