CodeVision is a framework leveraging code-as-tool and comprehensive SFT/RL datasets to enable "thinking with images". It provides a unified view for reasoning with visual information using programming vision.
- Multi-turn Agent Loops: Supports complex interactions and reasoning chains for Qwen2.5-VL and Qwen3-VL series.
- Comprehensive Datasets: Includes high-quality SFT datasets (constructed via GPT-5-High) and RL datasets covering diverse domains.
- Thinking with Images: Enables models to process and reason about visual content programmatically.
These examples demonstrate the agent's ability to perform multi-turn reasoning and emergent tool usage.
| Case 1 | Case 2 |
|---|---|
![]() |
![]() |
Install the required dependencies. You need to choose between vllm and sglang for the inference backend.
pip install "torch==2.8.0" "torchvision==0.23.0"
# vllm >= 0.11.0 or sglang >= 0.5.3 for Qwen3-VL series support
# Pick one stack: vLLM OR SGLang (install the one you need)
pip install vllm==0.11.0 # option 1: vLLM stack
pip install "sglang[all]==0.5.3" # option 2: SGLang stack
# transformers >= 4.57.0 for Qwen3-VL series support
pip3 install transformers==4.57.0
# FlashAttention
pip install --no-cache-dir --use-pep517 flash-attn==2.8.3 --no-build-isolation
# Other dependencies
pip install -r requirements-runtime.txtThe SFT process uses LLaMA-Factory.
-
Prepare Data: Download the CodeVision-SFT Dataset.
-
Configure:
- Update
LLaMA-Factory/data/dataset_info.jsonwith your data path. - Review config files:
LLaMA-Factory/examples/train_full/qwen2_5vl_full_sft.yamlorLLaMA-Factory/examples/train_full/qwen3vl.yaml.
- Update
-
Train:
cd LLaMA-Factory pip install -e ".[torch,metrics]" --no-build-isolation # Example for Qwen3-VL Full SFT FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen3vl.yaml
Once you have the SFT model, proceed to RL training.
-
Prepare Data: Download the CodeVision-RL Dataset.
-
Deploy Judge: Start the LLM judge server (required for reward/evaluation).
vllm serve Qwen3-235B-A22B-Instruct-2507 \ --port 18901 \ --host :: \ --gpu-memory-utilization 0.8 \ --max-model-len 32768 \ --tensor-parallel-size 8 \ --trust-remote-code \ --disable-log-requests -
Train: Update the configuration in
recipe/codevision/qwen3_vl.sh(e.g., setMODEL_PATH,LLM_JUDGEandtrain_files...) and run:bash recipe/codevision/qwen3_vl.sh
To evaluate your model on benchmarks:
-
Edit
recipe/codevision/eval.shto include your target benchmarks (updatetest_filesandLLM_JUDGE). -
Run the evaluation script:
bash recipe/codevision/eval.sh
If you find this work useful, please cite our paper:
@article{guo2025thinking,
title={Thinking with Programming Vision: Towards a Unified View for Thinking with Images},
author={Guo, Zirun and Hong, Minjie and Zhang, Feng and Jia, Kai and Jin, Tao},
journal={arXiv preprint arXiv:2512.03746},
year={2025}
}



