Capflow Tools

📖 Introduction

Capflow is a multi-domain caption pipeline based on vLLM. This pipeline decomposes visual caption tasks into three subtasks perception, reasoning, and summary, and merges them into a unified caption to maximize the potential of open-source models. Capflow achieves near GPT-4.1-level performance in natural image description, causal reasoning, and video description tasks. The pipeline is highly extensible: you can flexibly add or optimize agent workflows and prompts for specific domains as needed, making it suitable for specialized and fine-grained scenarios.

🚀 Step0: Data Preprocessing

1. Sub-Annotation Format

Each sub-dataset annotation file should follow the structure below:

[
  {
    "id": 0,
    "image": "ADE_train_00002626.jpg",
    "conversations": [
      { "from": "system", "value": "SYSTEM PROMPT" },
      { "from": "human",  "value": "<image>\n QUESTION PROMPT" },
      { "from": "gpt",    "value": "GPT Response" }
    ],
    "width": 1024,
    "height": 1024
  }
]

2. MetaFile Format

You need a meta file that integrates all sub-annotation files. We provide an example Metafile in MetaCaptioner/data/Metafile.json. The keys are dataset names, and the values contain related information:

{
  "ADE20K": {
    "root": "/Prefix/Path/to/image/or/video",
    "annotation": "/Path/To/Annotation/File/ADE20k.json",
    "length": 100000,
    "task": "Natural",
    "sample_count": 42522,
    "modality": "image"
  }
}

Descriptions:

root: Prefix path to visual input; combined with annotation suffix for full path
annotation: JSON file storing original annotations and image paths
length: Total dataset size
sample_count: Number of samples to be used
task: Workflow setting at the dataset level (if the data itself contains a task field, it takes priority)
modality: Data modality (e.g., image, video)

3. Sharding Strategy

For multi-node inference, it is recommended to pre-define each node’s workload to ensure balanced inference. We provide the example split_plan.jsonl file in MetaCaptioner/data folder. The format for the sharding plan in JSONL is as follows:

{
  "data_name": "dataset",
  "json_url": "dataset.json",
  "root": "/path/for/prefix/path",
  "task": "Natural",
  "modality": "image",
  "total": 1227,
  "processed": 1056,
  "remain": 171,
  "shard_index": 0,
  "num_shards": 1,
  "start_idx": 1056,
  "end_idx": 1227,
  "anno_file_key": "vqa_rad_en_20240402.json",
  "assign_rank": 4
}

Additional Parameter Descriptions:

task: Must be chosen from the nine predefined Capflow categories
total: Total data length
processed: Number of processed samples
remain: Remaining samples
shard_index: Shard number (for splitting a large dataset across multiple ranks)
num_shards: Total number of shards
start_idx, end_idx: Start and end indices for sampling
anno_file_key: Annotation file name
assign_rank: Assigned rank number for distributed inference

By pre-assigning workloads to different ranks, cluster inference efficiency is improved, and issues of oversized datasets or uneven data distribution are effectively solved.

4. Domain Definition

Capflow supports building different workflows for different data domains, either by predefined dataset-level workflows or by fine-grained data-level workflows.

A. Predefined Domain for Dataset

This method is suitable for known data sources; you can directly specify the task type in the split plan or metafile.

B. Domain Router Assignment

This method is suitable for more fine-grained data description. We provide inferdomain.py, a script based on Qwen2.5VL, which assigns a domain to each visual input:

bash script/run_domain.sh

The script will perform workflow routing for each data item, and routing information will be stored in the output folder. You can modify the script to fit your specific paths. After processing, all data will have domain assignment and confidence scores. You can use all data directly or filter out data with low confidence before captioning.

- List of Predefined Domains

We have listed some representative domains for your reference.

Domain	Subdomain
Natural	Scenes Object, Social Activity, Animal and Plants, Remote Sensing
Structure & Math	Chart, Table, Equation, Geometry, Diagram
Infographic & Document	Natural OCR, Document, Poster, Table Docs
Medical & Bio-Imaging	Radiology, Pathology, Clinical, Case Report
UI & Interaction	Website, Mobile, Tablet
Code & Programming	Code OCR, Code Understanding, Web Code
Knowledge & Education	Science Knowledge, Art, Culture, Natural Biology, Celebrity
Synthetic & Aesthetic	Text2Image, Aesthetic
Video & Temporal	Human Activity, Education, Movies, Egocentric, Reasoning, Sports

🛠️ Usage

Capflow utilizes vLLM for model parallelism and predefined data parallelism to enable batch inference. On the data side, datasets are sharded and assigned to specific ranks for balanced workloads and flexible checkpointing. On the model side, vLLM’s batch engine enables model parallelism for various model sizes.

Once the data is ready, you can follow the steps below for automatic captioning and filtering:

Step 1: Image/Video Captioning

Use the following script to run distributed automated image/video captioning:

bash ./script/run_caption.sh

This will use captionpipeline.py for batch inference. System prompts for different domains are in Prompt.py. You can optimize the agent workflow in system_prompt_map_dict in Prompt.py to suit your specific needs.

Step 2: Quality Filtering

Use the following script to run automated quality scoring:

bash ./script/run_filter.sh

This script will automatically detect and process Rank*_node*_*.jsonl annotation files in the output folder and perform batch inference after automatic rank division.

Finally, use the following command to filter the data by score:

python dataprocess/filterd_by_score.py --data-folder ./Output/Caption --min-overall 3 --min-dim-score 3

Compare with GPT4.1

We use Qwen2.5-VL-72B to build Capflow and compare with GPT-4.1 on various visual understanding benchamrks and reasoning benchmarks to evalute the caption ability. We evaluate the inference cost with the official API prise for GPT-4.1 and Qwen2.5-VL-72B in 100 images.

Model	MMMU	MMVet	Math Verse	Math Vista	Chart QA	Info VQA	AI2D	Video MME	Cost
GPT-4.1	55.7	61.7	56.8	65.0	62.3	63.2	75.5	26.8	$1.47
Capflow with Qwen2.5-VL-72B	55.1	57.8	53.1	62.5	59.2	50.2	74.2	27.6	$0.14

Caption Speed

The approximate maximum inference speed was evaluated on a single A800 or H200 GPU using Qwen2.5-VL-72B as the inference agent.

GPUs	Max Speed (Image/hour)	Min Speed (Video/hour)
A800 80G	1427	128
H200 140G	2542	245

Contribution

We plan to further upgrade and optimize this annotation pipeline in the future, and release it as a toolkit. Feel free to open issues or submit pull requests if you have suggestions or improvements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capflow Tools

📖 Introduction

🚀 Step0: Data Preprocessing

1. Sub-Annotation Format

2. MetaFile Format

3. Sharding Strategy

4. Domain Definition

A. Predefined Domain for Dataset

B. Domain Router Assignment

- List of Predefined Domains

🛠️ Usage

Step 1: Image/Video Captioning

Step 2: Quality Filtering

Compare with GPT4.1

Caption Speed

Contribution

FilesExpand file tree

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Capflow Tools

📖 Introduction

🚀 Step0: Data Preprocessing

1. Sub-Annotation Format

2. MetaFile Format

3. Sharding Strategy

4. Domain Definition

A. Predefined Domain for Dataset

B. Domain Router Assignment

- List of Predefined Domains

🛠️ Usage

Step 1: Image/Video Captioning

Step 2: Quality Filtering

Compare with GPT4.1

Caption Speed

Contribution