Skip to content

Minimalist RL for Diffusion LLMs with SOTA reasoning performance (89.1% GSM8K). Official implementation of "The Flexibility Trap".

License

Notifications You must be signed in to change notification settings

LeapLabTHU/JustGRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JustGRPO

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Zanlin Ni1Shenzhi Wang1Yang Yue1Tianyu Yu2Weilin Zhao2Yeguo Hua3

Tianyi Chen3   Jun Song4   Cheng Yu4   Bo Zheng4Gao Huang1✉

1LeapLab, Tsinghua University   2NLPLab, Tsinghua University   3Tsinghua University   4Alibaba Group

Project arXiv License Model

No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.

Just GRPO.

📢 News

  • [2026.01] 📄 Paper available on arXiv!
  • [2026.01] 🎉 Training code, evaluation scripts, and model checkpoint on GSM8K released!

📋 TODO

  • Add support for MATH-500
  • Add support for coding tasks (HumanEval, MBPP)

Why JustGRPO?

Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?

Mechanism to Pass@k

We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") — the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).

Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.

Results

JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:

Accuracy Comparison

Simplicity

Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:

Challenge Description
Combinatorial trajectories Optimizing over factorial-sized denoising paths
Intractable likelihoods ELBO-based surrogates instead of true objectives
Sampler-learner mismatch Confidence-based samplers vs. original diffusion prior
  • JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
  • The core logic of JustGRPO (grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.

💡 The model still retains parallel decoding at inference time — we only use AR order during training. See our paper for more details.

Installation

JustGRPO is designed to be lightweight and dependency-minimal.

git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txt

Dependencies:

  • accelerate
  • transformers
  • datasets
  • Standard evaluation utilities (sympy, latex2sympy2, etc.)

Usage

We provide training and evaluation code on GSM8K. The RL-trained model is available at Huggingface.

Training

accelerate launch --num_processes 8 --main_process_ip localhost --config_file configs/fsdp.yaml train.py \
  --run_dir ./checkpoints \
  --grad_accum 8

Note: Keep global batch size = num_gpus × grad_accum = 64.

Adjust --grad_accum based on your GPU count (e.g., 16 GPUs → --grad_accum 4).

Evaluation

torchrun --standalone --nproc-per-node=8 eval.py \
  --ckpt_path /path/to/ckpt \
  --steps 256 \
  --gen_length 256 \
  --block_length 32

Citation

If you find this work useful, please cite:

@article{ni2026flexibility,
  title={The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models},
  author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
  journal={arXiv preprint arXiv:2601.15165},
  year={2026}
}

Acknowledgments

This project builds upon the following excellent works:

We thank the authors for their open-source contributions to the community.

About

Minimalist RL for Diffusion LLMs with SOTA reasoning performance (89.1% GSM8K). Official implementation of "The Flexibility Trap".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages