feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks by aoshen524 · Pull Request #357 · alibaba/ROLL

aoshen524 · 2026-02-16T03:31:34Z

Summary

Adds Vision Data Parallel (Vision DP) to distribute whole images across Ulysses SP ranks for parallelized ViT computation
Hugely reduces ViT memory overhead: without this, every SP rank redundantly processes ALL images through the VisionTransformer. With Vision DP, each rank only processes 1/sp_size of the images, reducing ViT peak memory by ~sp_sizex (e.g. SP=4 → ~4x ViT memory reduction)
When ulysses_size > 1, each rank processes a subset of images independently, then all-gathers embeddings once at the end
Model-agnostic create_dp_vision_forward() wrapper supports any VisionTransformer with forward(self, hidden_states, grid_thw) signature
Supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and Qwen3-VL-MoE VisionTransformers
Includes GatherVisionEmbeddings custom autograd function with proper gradient scaling for distributed training compatibility
Ported from verl (verl-project/verl#5230)

Why this matters

In VLM RL training with Ulysses SP, the ViT (VisionTransformer) is a major memory bottleneck. Text SP splits the sequence across ranks at each attention layer, but the ViT runs on the full set of images on every rank — meaning ViT memory usage is completely unaffected by SP. For scenarios with many images (e.g. multi-turn GUI agent training with screenshots), ViT activation memory can dominate.

Vision DP solves this by distributing images at the ViT level:

Before: Each of N SP ranks processes ALL images → ViT memory = O(total_images)
After: Each rank processes total_images/N images → ViT memory = O(total_images/N)

Key design choices

Image-level distribution (not patch-level): avoids breaking ViT's internal cu_seqlens tracking
Contiguous assignment: rank 0 gets images [0,1,...], rank 1 gets next chunk, etc. — no reordering needed after all-gather
Gradient scaling: backward pass scales gradients by dp_size to compensate for partial image processing before reduction
Qwen3-VL deepstack support: handles tuple return (embeddings, deepstack_embeddings) from Qwen3-VL VisionModel

Files changed

File	Change
`roll/utils/context_parallel/vision_dp.py`	NEW — Core Vision DP utilities (image assignment, local extraction, all-gather with autograd)
`roll/utils/context_parallel/monkey_patch.py`	Add `apply_vision_dp_patch()` / `unapply_vision_dp_patch()` for Qwen2/2.5/3-VL VisionTransformers
`roll/utils/context_parallel/__init__.py`	Export new functions
`roll/distributed/strategy/deepspeed_strategy.py`	Call `apply_vision_dp_patch()` in both inference and training workers when `ulysses_size > 1`
`tests/utils/test_vision_dp_on_cpu.py`	NEW — 17 unit tests covering all utility functions and integration workflows

Test plan

17 unit tests covering get_image_patch_counts, assign_images_to_dp_ranks, prepare_local_vision_inputs
Integration tests for full workflow with varying image sizes
Edge cases: empty inputs, fewer images than ranks, single rank
Multi-GPU integration test with actual VLM model

🤖 Generated with Claude Code

…es SP ranks Distribute whole images across Ulysses SP ranks for parallelized ViT computation, reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction). Key changes: - Add roll/utils/context_parallel/vision_dp.py with image distribution utilities, GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper - Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE VisionTransformer classes - Integrate into DeepSpeed strategy (both inference and training workers) - Add 17 unit tests covering all utility functions, edge cases, and integration workflows Ported from verl (verl-project/verl#5230). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CLAassistant · 2026-02-16T03:31:41Z

All committers have signed the CLA.

Integrate upstream hf_flash_attention_patch for transformers>=4.53.0 alongside Vision DP patches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge origin/main and resolve conflicts

4628d4b

Integrate upstream hf_flash_attention_patch for transformers>=4.53.0 alongside Vision DP patches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357
aoshen524 wants to merge 2 commits intoalibaba:mainfrom
aoshen524:feat/vision-dp-ulysses

aoshen524 commented Feb 16, 2026

Uh oh!

CLAassistant commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

aoshen524 commented Feb 16, 2026

Summary

Why this matters

Key design choices

Files changed

Test plan

Uh oh!

CLAassistant commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

CLAassistant commented Feb 16, 2026 •

edited

Loading