Skip to content

EvolvingLMMs-Lab/ParaVT

Repository files navigation

ParaVT

Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Paper Website Data Models Daily Paper

Overview

Long-video understanding is increasingly framed as agentic video reasoning: a large multimodal model (LMM) post-trained with reinforcement learning to invoke video-processing tools. Prior work in this line, including our earlier LongVT (CVPR 2026), dispatches tool calls sequentially — brittle to single mis-localizations, prone to multi-turn context drift, and linear in cost.

ParaVT architecture: sequential one-tool-per-turn (left) vs parallel single-turn dispatch with weight-sharing sub-agents (right)

ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling: a main agent issues multiple temporal-window crops in a single turn, dispatches them to weight-sharing sub-agents, and aggregates the parallel evidence into a final answer.

Two failure modes of vanilla GRPO on a tool-native LMM: Format Fragility (left) and Tool Necessity Gap (right)

Standard GRPO on a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior — Format Fragility (left, the SFT-learned structural tags collapse under temperature sampling) and the Tool Necessity Gap (right, the skip-tool reward shortcut). We name this trade-off the Tool Prior Paradox and tame it with PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward at the structural tokens most prone to collapse, paired with a per-prompt frame-budget randomization that lets calling the tool earn measurable RL credit.

Table of Contents

Prerequisites

Data & Weight

Asset HuggingFace Used by
Media Files ParaVT/ParaVT-Source Source videos + auxiliary images referenced by every parquet row
Training Data ParaVT/ParaVT-Parquet Annotations for both cold-start SFT and agentic RL
Checkpoint ParaVT/ParaVT-8B Model weights after agentic post-training with PARA-GRPO

After downloading both datasets, run python -m paravt.data.materialize once to convert the parquets' sentinel paths into local file:// URIs; see paravt/data/README.md for the round-trip flow.

Environment

Python 3.10–3.12 (uv handles installation); CUDA 12.6 toolchain matching the pinned torch==2.9.1+cu126; cuDNN ≥ 9.15 (we pin 9.16.0.29 — older cuDNN triggers a known PyTorch nn.Conv3d perf regression that hangs SGLang's multimodal rollout for tens of minutes); at least 8 × NVIDIA GPUs with ≥ 80 GB VRAM each.

We ship three isolated uv virtual environments — one per workload (sft, rl, eval) — so each vendored framework can pin its own torch / transformers / sglang versions without conflict. Lock files live in requirements/; see requirements/README.md for how the three venvs are assembled.

git clone https://github.com/EvolvingLMMs-Lab/ParaVT.git
cd ParaVT
cp .secrets.env.example .secrets.env && $EDITOR .secrets.env   # HF_TOKEN, WANDB_API_KEY, paths

bash scripts/setup_env.sh all          # or sft / rl / eval to install just one

Cold-Start SFT

Cold-start supervised fine-tuning produces the tool-aware checkpoint that initializes the agentic RL stage. The recipe vendors lmms-engine at paravt/sft/lmms-engine/; diffs we maintain on top of upstream sit in patches/lmms-engine/. Full config explanations, data-format notes, and 4-GPU smoke-test recipes live in paravt/sft/README.md.

bash scripts/run_sft.sh                # full 8-GPU cold-start run

Agentic RL

PARA-GRPO post-trains the cold-start checkpoint via the vendored AReaL framework at paravt/rl/areal/ (patches at patches/areal/). Each PARA-GRPO knob (Exploration Anchoring strength, nFrames Gating distribution, reward weights, subagent dispatch) is exposed in the YAML config. Per-knob explanations, vanilla-GRPO comparison, and the reward-module entry points are in paravt/rl/README.md.

bash scripts/run_rl.sh                 # full 7-GPU FSDP + 1-GPU SGLang rollout

Evaluation

The eval driver (paravt.eval.driver) reproduces every headline number from the paper across the seven splits (VideoMME w/o sub, VideoMME w/ sub, LongVideoBench, LVBench, MLVU, MMVU, Charades-STA test). Vendored lmms-eval sits at paravt/eval/lmms-eval/. Per-row reproduce scripts and the with-tool vs. no-tool protocol details are in paravt/eval/README.md.

PARAVT_EVAL_MODEL=ParaVT/ParaVT-8B \
    bash paravt/eval/scripts/reproduce_paravt_8b.sh

Acknowledgements

ParaVT builds on three open-source frameworks: lmms-engine for SFT, AReaL for RL, and lmms-eval for evaluation. We thank their authors and maintainers, and welcome pull requests, issues, and discussions from the community.

Citation

If you find this project helpful, please consider citing our paper:

@misc{yang2026paravt,
  title={{ParaVT}: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning},
  author={Zuhao Yang and Kaichen Zhang and Sudong Wang and Keming Wu and Zhongyu Yang
          and Bo Li and Xiaojuan Qi and Shijian Lu and Xingxuan Li and Lidong Bing},
  year={2026},
  eprint={2605.20342},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

About

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages