Skip to content

Latest commit

 

History

History
209 lines (157 loc) · 8.57 KB

File metadata and controls

209 lines (157 loc) · 8.57 KB

Capflow Tools

📖 Introduction

Capflow is a multi-domain caption pipeline based on vLLM. This pipeline decomposes visual caption tasks into three subtasks perception, reasoning, and summary, and merges them into a unified caption to maximize the potential of open-source models. Capflow achieves near GPT-4.1-level performance in natural image description, causal reasoning, and video description tasks. The pipeline is highly extensible: you can flexibly add or optimize agent workflows and prompts for specific domains as needed, making it suitable for specialized and fine-grained scenarios.

🚀 Step0: Data Preprocessing

1. Sub-Annotation Format

Each sub-dataset annotation file should follow the structure below:

[
  {
    "id": 0,
    "image": "ADE_train_00002626.jpg",
    "conversations": [
      { "from": "system", "value": "SYSTEM PROMPT" },
      { "from": "human",  "value": "<image>\n QUESTION PROMPT" },
      { "from": "gpt",    "value": "GPT Response" }
    ],
    "width": 1024,
    "height": 1024
  }
]

2. MetaFile Format

You need a meta file that integrates all sub-annotation files. We provide an example Metafile in MetaCaptioner/data/Metafile.json. The keys are dataset names, and the values contain related information:

{
  "ADE20K": {
    "root": "/Prefix/Path/to/image/or/video",
    "annotation": "/Path/To/Annotation/File/ADE20k.json",
    "length": 100000,
    "task": "Natural",
    "sample_count": 42522,
    "modality": "image"
  }
}

Descriptions:

  • root: Prefix path to visual input; combined with annotation suffix for full path
  • annotation: JSON file storing original annotations and image paths
  • length: Total dataset size
  • sample_count: Number of samples to be used
  • task: Workflow setting at the dataset level (if the data itself contains a task field, it takes priority)
  • modality: Data modality (e.g., image, video)

3. Sharding Strategy

For multi-node inference, it is recommended to pre-define each node’s workload to ensure balanced inference. We provide the example split_plan.jsonl file in MetaCaptioner/data folder. The format for the sharding plan in JSONL is as follows:

{
  "data_name": "dataset",
  "json_url": "dataset.json",
  "root": "/path/for/prefix/path",
  "task": "Natural",
  "modality": "image",
  "total": 1227,
  "processed": 1056,
  "remain": 171,
  "shard_index": 0,
  "num_shards": 1,
  "start_idx": 1056,
  "end_idx": 1227,
  "anno_file_key": "vqa_rad_en_20240402.json",
  "assign_rank": 4
}

Additional Parameter Descriptions:

  • task: Must be chosen from the nine predefined Capflow categories
  • total: Total data length
  • processed: Number of processed samples
  • remain: Remaining samples
  • shard_index: Shard number (for splitting a large dataset across multiple ranks)
  • num_shards: Total number of shards
  • start_idx, end_idx: Start and end indices for sampling
  • anno_file_key: Annotation file name
  • assign_rank: Assigned rank number for distributed inference

By pre-assigning workloads to different ranks, cluster inference efficiency is improved, and issues of oversized datasets or uneven data distribution are effectively solved.

4. Domain Definition

Capflow supports building different workflows for different data domains, either by predefined dataset-level workflows or by fine-grained data-level workflows.

A. Predefined Domain for Dataset

This method is suitable for known data sources; you can directly specify the task type in the split plan or metafile.

B. Domain Router Assignment

This method is suitable for more fine-grained data description. We provide inferdomain.py, a script based on Qwen2.5VL, which assigns a domain to each visual input:

bash script/run_domain.sh

The script will perform workflow routing for each data item, and routing information will be stored in the output folder. You can modify the script to fit your specific paths. After processing, all data will have domain assignment and confidence scores. You can use all data directly or filter out data with low confidence before captioning.

- List of Predefined Domains

We have listed some representative domains for your reference.

Domain Subdomain
Natural Scenes Object, Social Activity, Animal and Plants, Remote Sensing
Structure & Math Chart, Table, Equation, Geometry, Diagram
Infographic & Document Natural OCR, Document, Poster, Table Docs
Medical & Bio-Imaging Radiology, Pathology, Clinical, Case Report
UI & Interaction Website, Mobile, Tablet
Code & Programming Code OCR, Code Understanding, Web Code
Knowledge & Education Science Knowledge, Art, Culture, Natural Biology, Celebrity
Synthetic & Aesthetic Text2Image, Aesthetic
Video & Temporal Human Activity, Education, Movies, Egocentric, Reasoning, Sports

🛠️ Usage

Capflow utilizes vLLM for model parallelism and predefined data parallelism to enable batch inference. On the data side, datasets are sharded and assigned to specific ranks for balanced workloads and flexible checkpointing. On the model side, vLLM’s batch engine enables model parallelism for various model sizes.

Once the data is ready, you can follow the steps below for automatic captioning and filtering:

Step 1: Image/Video Captioning

Use the following script to run distributed automated image/video captioning:

bash ./script/run_caption.sh

This will use captionpipeline.py for batch inference. System prompts for different domains are in Prompt.py. You can optimize the agent workflow in system_prompt_map_dict in Prompt.py to suit your specific needs.

Step 2: Quality Filtering

Use the following script to run automated quality scoring:

bash ./script/run_filter.sh

This script will automatically detect and process Rank*_node*_*.jsonl annotation files in the output folder and perform batch inference after automatic rank division.

Finally, use the following command to filter the data by score:

python dataprocess/filterd_by_score.py --data-folder ./Output/Caption --min-overall 3 --min-dim-score 3

Compare with GPT4.1

We use Qwen2.5-VL-72B to build Capflow and compare with GPT-4.1 on various visual understanding benchamrks and reasoning benchmarks to evalute the caption ability. We evaluate the inference cost with the official API prise for GPT-4.1 and Qwen2.5-VL-72B in 100 images.

Model MMMU MMVet Math Verse Math Vista Chart QA Info VQA AI2D Video MME Cost
GPT-4.1 55.7 61.7 56.8 65.0 62.3 63.2 75.5 26.8 $1.47
Capflow with Qwen2.5-VL-72B 55.1 57.8 53.1 62.5 59.2 50.2 74.2 27.6 $0.14

Caption Speed

The approximate maximum inference speed was evaluated on a single A800 or H200 GPU using Qwen2.5-VL-72B as the inference agent.

<style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} </style>
GPUs Max Speed
(Image/hour)
Min Speed
(Video/hour)
A800 80G 1427 128
H200 140G 2542 245

Contribution

We plan to further upgrade and optimize this annotation pipeline in the future, and release it as a toolkit. Feel free to open issues or submit pull requests if you have suggestions or improvements!