Hao Chen, Jiaming Liu, Zhonghao Yan, Nuowei Han, Renrui Zhang, Chenyang Gu, Jialin Gao, Ziyu Guo, Siyuan Qian, Yinxi Wang, Peng Jia, Shanghang Zhang, Pheng-Ann Heng
🤖The Framework of LaST-R1. (a) LaST-R1 VLA is a unified model that takes visual observations and language instructions as input, where a vision foundation model provides physically grounded latent targets to guide latent CoT reasoning before action generation. (b) During LAPO RL post-training, LaST-R1 interacts with the environment in a closed loop manner, storing latents, actions, and rewards in a rollout buffer for jointly reshaping the latent and action spaces. It further enables adaptive reasoning by learning to emit the <latent_end> token based on predicted probabilities, dynamically adjusting the reasoning horizon across tasks. (c) Through LAPO, LaST-R1 achieves adaptive reasoning lengths across diverse tasks, improving generalization and execution stability.
- [2026/05/06] LaST-R1 is now live on arXiv! The code and model checkpoints for both simulation SFT and RL are publicly available! 🚀
The code is built using Python 3.10, we also recommand to use Python above Python 3.10. We require PyTorch >= 2.2.0 and CUDA >= 12.0 (It may run with lower versions, but we have not tested it). We recommend using Miniconda and create an environment as follows:
conda create -n last-r1 python=3.10 -y
conda activate last-r1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Clone veRL (recommended to place at the same level as LaST-R1, not inside the LaST-R1 folder)
git clone -b v0.2.x https://github.com/volcengine/verl.git
cd /path/to/verl
# Replace the installed pyproject.toml file with our custom ./setup/pyproject.toml file.
pip install -e .
# Clone LIBERO (recommended to place at the same level as LaST-R1, not inside the LaST-R1 folder)
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd /path/to/LIBERO
pip install -e .Our code is built based on Qwen3-VL and veRL, organized in the following framework:
verl/trainer/main_ppo.py: LAPO training entry point that initializes the trainer and launches the training pipelineverl/trainer/config/ppo_trainer.yaml: default LAPO training configuration (data, optimization, rollout, logging, and runtime settings)verl/trainer/ppo/ray_trainer.py: coordinates distributed training, including rollout, reward/advantage computation, and actor updatesverl/trainer/ppo/core_algos.py: core LAPO utilities and algorithm logic used by the trainer (e.g., advantage/return and policy optimization helpers)verl/workers/rollout/rob_rollout.py: handles environment interaction, trajectory collection, and action generation during rolloutverl/workers/fsdp_workers.py: defines FSDP-based worker abstractions for distributed model execution and training/inference worker behaviorsverl/workers/actor/dp_rob.py: actor-side training logic, including loss computation, value prediction, and policy updatesverl/workers/actor/action_tokenizer.py: converts continuous robot actions to discrete tokens and backtransformers/models/qwen3_vl/modeling_qwen3_vl.py: core Qwen3-VL model implementation, including latent/action modeling and value head integrationtransformers/integrations/sdpa_attention.py: SDPA attention integration and attention-mask related logic
We would like to thank SimpleVLA-RL 🥳, upon which our repo is built.
We release all LIBERO warmup SFT models and post-RL models on Huggingface 🤗 as follows:
- last-r1-warmup-libero_spatial-oneshot | last-r1-rl-libero_spatial
- last-r1-warmup-libero_object-oneshot | last-r1-rl-libero_object
- last-r1-warmup-libero_goal-oneshot | last-r1-rl-libero_goal
- last-r1-warmup-libero_10-oneshot | last-r1-rl-libero_10
-
The main training and evaluation script is
scripts/run_libero_rl_training.sh. When editing this script, please pay attention to:SFT_MODEL_PATH(warm-up model checkpoint path),DATA_STATUS(dataset statistics.jsonused for action normalization), andALIGN_PATH(runtime environmentalign.jsonconfig) must be valid paths.VAL_ONLYcontrols train/eval mode: setVAL_ONLY=Falsefor training andVAL_ONLY=Truefor evaluation.DATASET_NAMEshould match your benchmark split (libero_spatial,libero_object,libero_goal, orlibero_10), and related parameters (e.g.,max_prompt_length,traj_mini_batch_size, and*_max_steps) should be updated consistently.NUM_GPUSaffects several effective batch settings (e.g.,actor.ppo_micro_batch_size=$NUM_GPUSand recommendedval_batch_size); if you change GPU count, adjust batch sizes accordingly to avoid OOM or shape mismatches.
-
Hardware-specific initialization note: in
verl/workers/rollout/rob_rollout.py, functionenv_worker(...)(around lines 315-316), we applyinitial_state[12] += 0.038forlibero_spatialtask 5 to avoid environment initialization failure observed on NVIDIA H20 machines. If your hardware does not show this issue, you may remove this workaround based on your setup. -
We evaluate LaST-R1 on LIBERO and achieve state-of-the-art performance.
This project is licensed under the MIT License - see the LICENSE file for details.
@article{chen2026last,
title = {LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models},
author = {Chen, Hao and Liu, Jiaming and Yan, Zhonghao and Han, Nuowei and Zhang, Renrui and Gu, Chenyang and Gao, Jialin and Guo, Ziyu and Qian, Siyuan and Wang, Yinxi and others},
journal = {arXiv preprint arXiv:2604.28192},
year = {2026}
}
