Capflow is a multi-domain caption pipeline based on vLLM. This pipeline decomposes visual caption tasks into three subtasks perception, reasoning, and summary, and merges them into a unified caption to maximize the potential of open-source models. Capflow achieves near GPT-4.1-level performance in natural image description, causal reasoning, and video description tasks. The pipeline is highly extensible: you can flexibly add or optimize agent workflows and prompts for specific domains as needed, making it suitable for specialized and fine-grained scenarios.
Each sub-dataset annotation file should follow the structure below:
[
{
"id": 0,
"image": "ADE_train_00002626.jpg",
"conversations": [
{ "from": "system", "value": "SYSTEM PROMPT" },
{ "from": "human", "value": "<image>\n QUESTION PROMPT" },
{ "from": "gpt", "value": "GPT Response" }
],
"width": 1024,
"height": 1024
}
]You need a meta file that integrates all sub-annotation files. We provide an example Metafile in MetaCaptioner/data/Metafile.json. The keys are dataset names, and the values contain related information:
{
"ADE20K": {
"root": "/Prefix/Path/to/image/or/video",
"annotation": "/Path/To/Annotation/File/ADE20k.json",
"length": 100000,
"task": "Natural",
"sample_count": 42522,
"modality": "image"
}
}Descriptions:
root: Prefix path to visual input; combined with annotation suffix for full pathannotation: JSON file storing original annotations and image pathslength: Total dataset sizesample_count: Number of samples to be usedtask: Workflow setting at the dataset level (if the data itself contains ataskfield, it takes priority)modality: Data modality (e.g., image, video)
For multi-node inference, it is recommended to pre-define each node’s workload to ensure balanced inference. We provide the example split_plan.jsonl file in MetaCaptioner/data folder. The format for the sharding plan in JSONL is as follows:
{
"data_name": "dataset",
"json_url": "dataset.json",
"root": "/path/for/prefix/path",
"task": "Natural",
"modality": "image",
"total": 1227,
"processed": 1056,
"remain": 171,
"shard_index": 0,
"num_shards": 1,
"start_idx": 1056,
"end_idx": 1227,
"anno_file_key": "vqa_rad_en_20240402.json",
"assign_rank": 4
}Additional Parameter Descriptions:
task: Must be chosen from the nine predefined Capflow categoriestotal: Total data lengthprocessed: Number of processed samplesremain: Remaining samplesshard_index: Shard number (for splitting a large dataset across multiple ranks)num_shards: Total number of shardsstart_idx,end_idx: Start and end indices for samplinganno_file_key: Annotation file nameassign_rank: Assigned rank number for distributed inference
By pre-assigning workloads to different ranks, cluster inference efficiency is improved, and issues of oversized datasets or uneven data distribution are effectively solved.
Capflow supports building different workflows for different data domains, either by predefined dataset-level workflows or by fine-grained data-level workflows.
This method is suitable for known data sources; you can directly specify the task type in the split plan or metafile.
This method is suitable for more fine-grained data description. We provide inferdomain.py, a script based on Qwen2.5VL, which assigns a domain to each visual input:
bash script/run_domain.shThe script will perform workflow routing for each data item, and routing information will be stored in the output folder. You can modify the script to fit your specific paths. After processing, all data will have domain assignment and confidence scores. You can use all data directly or filter out data with low confidence before captioning.
We have listed some representative domains for your reference.
| Domain | Subdomain |
|---|---|
| Natural | Scenes Object, Social Activity, Animal and Plants, Remote Sensing |
| Structure & Math | Chart, Table, Equation, Geometry, Diagram |
| Infographic & Document | Natural OCR, Document, Poster, Table Docs |
| Medical & Bio-Imaging | Radiology, Pathology, Clinical, Case Report |
| UI & Interaction | Website, Mobile, Tablet |
| Code & Programming | Code OCR, Code Understanding, Web Code |
| Knowledge & Education | Science Knowledge, Art, Culture, Natural Biology, Celebrity |
| Synthetic & Aesthetic | Text2Image, Aesthetic |
| Video & Temporal | Human Activity, Education, Movies, Egocentric, Reasoning, Sports |
Capflow utilizes vLLM for model parallelism and predefined data parallelism to enable batch inference. On the data side, datasets are sharded and assigned to specific ranks for balanced workloads and flexible checkpointing. On the model side, vLLM’s batch engine enables model parallelism for various model sizes.
Once the data is ready, you can follow the steps below for automatic captioning and filtering:
Use the following script to run distributed automated image/video captioning:
bash ./script/run_caption.shThis will use captionpipeline.py for batch inference. System prompts for different domains are in Prompt.py. You can optimize the agent workflow in system_prompt_map_dict in Prompt.py to suit your specific needs.
Use the following script to run automated quality scoring:
bash ./script/run_filter.shThis script will automatically detect and process Rank*_node*_*.jsonl annotation files in the output folder and perform batch inference after automatic rank division.
Finally, use the following command to filter the data by score:
python dataprocess/filterd_by_score.py --data-folder ./Output/Caption --min-overall 3 --min-dim-score 3We use Qwen2.5-VL-72B to build Capflow and compare with GPT-4.1 on various visual understanding benchamrks and reasoning benchmarks to evalute the caption ability. We evaluate the inference cost with the official API prise for GPT-4.1 and Qwen2.5-VL-72B in 100 images.
| Model | MMMU | MMVet | Math Verse | Math Vista | Chart QA | Info VQA | AI2D | Video MME | Cost |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 55.7 | 61.7 | 56.8 | 65.0 | 62.3 | 63.2 | 75.5 | 26.8 | $1.47 |
| Capflow with Qwen2.5-VL-72B | 55.1 | 57.8 | 53.1 | 62.5 | 59.2 | 50.2 | 74.2 | 27.6 | $0.14 |
The approximate maximum inference speed was evaluated on a single A800 or H200 GPU using Qwen2.5-VL-72B as the inference agent.
<style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style>| GPUs | Max Speed (Image/hour) |
Min Speed (Video/hour) |
|---|---|---|
| A800 80G | 1427 | 128 |
| H200 140G | 2542 | 245 |
We plan to further upgrade and optimize this annotation pipeline in the future, and release it as a toolkit. Feel free to open issues or submit pull requests if you have suggestions or improvements!