Skip to content

Math-VR Benchmark & CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Notifications You must be signed in to change notification settings

HKU-MMLab/Math-VR-CodePlot-CoT

Repository files navigation

Math-VR Benchmark & CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Home Paper Leaderboard visitors

Chengqi Duan1*, Kaiyue Sun1*, Rongyao Fang3*, Manyuan Zhang2†, Yan Feng2, Ying Luo2, Yufang Liu2, Ke Wang3, Peng Pei2, Xunliang Cai2, Hongsheng Li3, Yi Ma1, Xihui Liu1 ✉️

1HKU, 2Meituan, 3CUHK

*Equal contribution, †Project Lead , ✉️Corresponding author

 

  Paper •   Introduction •   Math-VRModelUsageEvaluationBenchmark results •   License •   Citation

Introduction

Recent advances in Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems that require visual assistance, such as drawing auxiliary lines or plotting functions. Most VLMs are constrained to text-only reasoning, while unified models that generate interleaved text and images often lack the precision required for mathematical tasks.

We present CodePlot-CoT, a code-driven Chain-of-Thought (CoT) paradigm that enables models to "think with images" in mathematics. Our approach leverages a VLM to generate both textual reasoning and executable plotting code. This code is then rendered into an image, serving as a "visual thought" that is reinput into the model to aid in problem solving. To facilitate this, we introduce Math-VR, the first large-scale, bilingual dataset and benchmark for mathematical problems requiring visual reasoning, comprising 178K samples. We also developed MatplotCode, a specialized image-to-code converter to generate high-quality training data. We benchmark SOTA models on our Math-VR . Our experiments show that CodePlot-CoT achieves up to a 21% performance increase over its base model, demonstrating the effectiveness of our code-driven reasoning paradigm.

The main contributions of our work can be summarized as follows:

  • We propose a novel and efficient paradigm that enables VLMs to engage in visual reasoning through code generation.
  • We construct Math-VR, the first large-scale, bilingual dataset and benchmark (178K samples) for Mathematical problems with Visual Reasoning.
  • We develop MatplotCode, a state-of-the-art image-to-code converter for mathematical figures, and train CodePlot-CoT model, a specialized model that achieves up to a 21% performance increase over strong baselines.

Released Data: Math-VR-train and Math-VR-bench

Dataset      Link
Math-VR-train 🤗 HuggingFace
Math-VR-bench 🤗 HuggingFace

Released Model: MatPlotCode and CodePlot-CoT

Model      Link
MatPlotCode 🤗 HuggingFace
CodePlot-CoT 🤗 HuggingFace

Math-VR

Math-VR is the first large-scale, bilingual (English and Chinese) dataset and benchmark specifically designed to evaluate and advance the visual reasoning capabilities of AI models in mathematics. While traditional benchmarks have focused on text-centric problem-solving, Math-VR targets the critical domain of problems that require "reasoning with images," such as drawing auxiliary lines or plotting functions to find a solution.

Math-VR dataset contains 178,000 samples, each consisting of a question, a detailed reasoning process, and a final answer. A key feature of this dataset is that the reasoning process for each problem includes at least one image, providing a rich resource for training models to integrate visual information into their problem-solving steps. The dataset spans multiple mathematical domains, including Geometry, Algebra, and Calculus.

Math-VR benchmark comsists of 5,000 bilingual (English and Chinese) mathematical questions. To ensure a deterministic and reliable evaluation, questions were carefully selected; for instance, proof-based questions were excluded to avoid the difficulty of assessing logical validity, and most multiple-choice questions were removed to prevent correct answers from random guessing. The benchmark is divided into two subsets: a Text subset with 2,000 text-only questions, and a Multimodal subset with 3,000 questions presented with both text and images. Both question types require models to reason or use imagination in the visual domain.We designed a comprehensive evaluation pipeline that uses two core metrics to measure a model's performance:

  • Answer Correctness (AC): This metric provides a reliable binary judgment by strictly checking whether the model's final answer perfectly matches the ground-truth answer. Any error or omission results in a score of 0.
  • Process Score (PS): Recognizing that the reasoning process can be valuable even if the final answer is incorrect, this metric awards partial credit. It assesses whether the model hits critical "scoring points"—such as applying theorems or performing necessary calculations—within its reasoning steps. This fine-grained assessment more accurately reflects a model's true problem-solving abilities.

Model Overview

CodePlot-CoT: Mathematical Visual Reasoning with Code-Driven Images

We introduce CodePlot-CoT, an innovative code-driven Chain-of-Thought (CoT) paradigm designed to enable Vision Language Models to "think with images" when solving mathematical problems. Rather than generating pixel-based images directly, the model outputs executable plotting code to represent its "visual thoughts". This code is executed to render a precise figure, which is then reinput to the model as a visual input for subsequent reasoning steps.

MatplotCode: A High-Fidelity Converter for Mathematical Figures

To train the CodePlot-CoT model, we require high-quality data pairing images with corresponding plotting code. Since such resources are rare and existing general models are unreliable for this specialized task, we develope MatplotCode, a state-of-the-art image-to-code converter designed specifically for mathematical figures. It is specialized in converting complex mathematical figures into high-fidelity Python plotting code. In our evaluation, MatplotCode achieve a 100% code execution success rate. Its image reconstruction fidelity is also significantly higher than SOTA models including GPT-03 and Gemini-2.5-Pro. MatplotCode is the key to enabling the large-scale curation of our code-driven training data, laying the foundation for the successful training of the CodePlot-CoT model.

Usage

Installation

Clone the repo and install dependent packages.

conda create -n codeplot python==3.10
conda activate codeplot
git clone git@github.com:HKU-MMLab/Math-VR-CodePlot-CoT.git
cd CodePlot-CoT
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1

For benchmark evaluation only.

pip install openai==4.1.1
pip install datasets==2.0.0

Model Weights

Expected directory structure might be:

CodePlot-CoT
├── ckpts
│   ├── CodePlot-CoT 
│   ├── MatPlotCode 
├── ...

Inference

# Convert image to python code with MatPlotCode
python image_to_code.py
# Solve math problems with CodePlot-CoT
python math_infer.py

Math-VR Benchmark Evaluation

To evaluate model on the Math-VR benchmark, please follow these steps:

  1. Download the Math-VR benchmark dataset from Hugging Face. This dataset contains our 2500 English test questions.
  2. Store the downloaded file in a data/ directory.
  3. Create a JSON file containing model's solutions.
    • Please refer to Math-VR-Infer.py for guidance on the generation process. Specifically, if you only wish to evaluate on the text subset or multimodal subset, you can pass the --type text or --type multimodal to the infer script.
    • The answer file must be a JSON object mapping each <question_id> to model's corresponding solution string. An example is provided in examples/answer.json.

The required format for the answer file is as follows:

{
  "<question_id>": "<Model's solution>",
  ...
}
  1. Run the evaluation script with the following command:
python Math-VR-Eval.py --answer_dir "<The Json file that contains model's solutions>" --result_dir "<The path to save the evaluation result in Json format>" --data_path "<The path to the benchmark dataset>" --api_key "<Your openai api key>"

The script leverages GPT4.1 to evaluate model's response and generates a result.json containing its judgement. 5. Summarize all scores by running:

python summarize_score.py --result_dir "<The path to the saved result>" --data_path "<The path to the benchmark dataset>"

Benchmark

The leaderboard is available here. We benchmark a suite of SOTA VLMs (Vision Language Models), UMs (Unified Models) and LLMs (Large Language Models) on Math-VR benchmark english subset with 2500 unique questions.

Math-VR benchmark (English) on VLMs and UMs
# Model Link Version #Params Type Thinking Overall (AC) Overall (PS) Text (AC) Text (PS) Multimodal (AC) Multimodal (PS)
1 Qwen3-VL-235B-A22B-Thinking 🥇 Link 235B VLM 66.8 81.0 58.9 77.4 72.1 83.4
2 Qwen3-VL-235B-A22B-Instruct 🥈 Link 235B VLM X 65.0 80.1 59.4 77.8 68.8 81.6
3 Gemini-2.5-Pro 🥉 Link VLM 64.7 80.8 58.7 77.9 68.7 82.8
4 Gemini-2.5-Flash Link 2025-06-17 VLM 60.5 78.4 57.0 77.5 62.9 79.0
5 GPT-o3 Link 2025-04-16 VLM 59.3 76.4 52.9 72.9 63.7 78.6
6 Seed-1.6-Thinking Link 2025-06-15 VLM 58.4 75.2 53.0 73.0 62.0 76.6
7 GPT-5-Thinking Link VLM 58.1 70.6 53.2 68.0 61.4 72.3
8 Claude Opus4.1 Link VLM 54.3 70.6 53.1 70.5 55.1 70.6
9 Nano Banana Link 2025-08-26 UM X 53.4 73.8 49.1 72.3 56.3 74.7
10 Gemini-2.5-Flash-No-Thinking Link 2025-06-17 VLM X 52.3 73.7 44.6 70.9 57.5 75.5
11 GLM-4.5V Link 108B VLM 49.6 69.7 48.0 70.5 50.6 69.1
12 Mimo-VL-7B-RL Link 2508 7B VLM 48.3 68.8 43.5 68.4 51.3 69.0
13 InternVL-3.5-8B Link 8B VLM 40.8 62.8 38.5 64.0 42.2 62.0
14 GPT-4.1-mini Link VLM X 33.3 60.0 33.3 62.0 33.3 58.6
15 GLM-4.1V-9B Link 9B VLM 29.0 53.4 27.8 54.4 29.9 52.7
16 Claude-Sonnet-4 Link 2025-05-23 VLM X 28.1 56.4 31.5 60.9 25.8 53.4
17 GPT-4.1 Link VLM X 26.0 53.9 26.6 56.5 25.6 52.2
18 CodePlot-CoT Link 32B VLM X 22.1 47.0 31.6 53.8 15.8 42.4
19 Gemini-2.0-Flash Link VLM X 20.6 50.7 24.1 56.1 18.3 47.0
20 Keye-VL-1.5 Link 8B VLM X 17.3 38.2 20.2 44.4 15.4 34.0
21 Gemma3 Link 27B VLM X 16.1 44.8 19.2 50.8 14.1 40.8
22 Qwen-2.5-VL-72B Link 72B VLM X 13.7 40.8 15.3 44.6 12.7 38.2
23 Bagel-Zebra-CoT Link 7B UM X 10.1 34.1 13.9 41.5 7.6 29.1
24 Qwen-2.5-VL-32B Link 32B VLM X 10.0 33.7 10.6 36.9 9.6 31.5
25 GPT-4.1-nano Link VLM X 9.1 38.5 13.1 45.9 6.4 33.6
26 InternVL-3.5-8B-No-Thinking Link 8B VLM X 7.9 31.4 9.2 35.6 7.0 28.6
27 Bagel Link 7B UM X 7.6 27.6 8.5 32.9 7.0 24.0
28 Qwen-2.5-VL-3B Link 3B VLM X 5.3 27.5 7.9 33.4 3.6 23.6
29 GPT-4o Link 2024-11-20 VLM X 4.3 30.4 5.7 34.6 3.4 27.6
Math-VR benchmark (English) on LLMs
# Model Link #Params Type Thinking Text (PS) Text (AC)
1 Deepseek-R1 Link 671B LLM 69.9 49.5

License

This code is released under the MIT License.

Citation

If you find this work helpful, please consider citing our paper:

@article{duan2025codeplot,
  title={CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images},
  author={Duan, Chengqi and Sun, Kaiyue and Fang, Rongyao and Zhang, Manyuan and Feng, Yan and Luo, Ying and Liu, Yufang and Wang, Ke and Pei, Peng and Cai, Xunliang and others},
  journal={arXiv preprint arXiv:2510.11718},
  year={2025}
}

Contact

If you have any questions, please raise an issue or contact us at duancq24@connect.hku.hk.

About

Math-VR Benchmark & CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages