Skip to content

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357

Open
aoshen524 wants to merge 2 commits intoalibaba:mainfrom
aoshen524:feat/vision-dp-ulysses
Open

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357
aoshen524 wants to merge 2 commits intoalibaba:mainfrom
aoshen524:feat/vision-dp-ulysses

Conversation

@aoshen524
Copy link

Summary

  • Adds Vision Data Parallel (Vision DP) to distribute whole images across Ulysses SP ranks for parallelized ViT computation
  • Hugely reduces ViT memory overhead: without this, every SP rank redundantly processes ALL images through the VisionTransformer. With Vision DP, each rank only processes 1/sp_size of the images, reducing ViT peak memory by ~sp_sizex (e.g. SP=4 → ~4x ViT memory reduction)
  • When ulysses_size > 1, each rank processes a subset of images independently, then all-gathers embeddings once at the end
  • Model-agnostic create_dp_vision_forward() wrapper supports any VisionTransformer with forward(self, hidden_states, grid_thw) signature
  • Supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and Qwen3-VL-MoE VisionTransformers
  • Includes GatherVisionEmbeddings custom autograd function with proper gradient scaling for distributed training compatibility
  • Ported from verl (verl-project/verl#5230)

Why this matters

In VLM RL training with Ulysses SP, the ViT (VisionTransformer) is a major memory bottleneck. Text SP splits the sequence across ranks at each attention layer, but the ViT runs on the full set of images on every rank — meaning ViT memory usage is completely unaffected by SP. For scenarios with many images (e.g. multi-turn GUI agent training with screenshots), ViT activation memory can dominate.

Vision DP solves this by distributing images at the ViT level:

  • Before: Each of N SP ranks processes ALL images → ViT memory = O(total_images)
  • After: Each rank processes total_images/N images → ViT memory = O(total_images/N)

Key design choices

  • Image-level distribution (not patch-level): avoids breaking ViT's internal cu_seqlens tracking
  • Contiguous assignment: rank 0 gets images [0,1,...], rank 1 gets next chunk, etc. — no reordering needed after all-gather
  • Gradient scaling: backward pass scales gradients by dp_size to compensate for partial image processing before reduction
  • Qwen3-VL deepstack support: handles tuple return (embeddings, deepstack_embeddings) from Qwen3-VL VisionModel

Files changed

File Change
roll/utils/context_parallel/vision_dp.py NEW — Core Vision DP utilities (image assignment, local extraction, all-gather with autograd)
roll/utils/context_parallel/monkey_patch.py Add apply_vision_dp_patch() / unapply_vision_dp_patch() for Qwen2/2.5/3-VL VisionTransformers
roll/utils/context_parallel/__init__.py Export new functions
roll/distributed/strategy/deepspeed_strategy.py Call apply_vision_dp_patch() in both inference and training workers when ulysses_size > 1
tests/utils/test_vision_dp_on_cpu.py NEW — 17 unit tests covering all utility functions and integration workflows

Test plan

  • 17 unit tests covering get_image_patch_counts, assign_images_to_dp_ranks, prepare_local_vision_inputs
  • Integration tests for full workflow with varying image sizes
  • Edge cases: empty inputs, fewer images than ranks, single rank
  • Multi-GPU integration test with actual VLM model

🤖 Generated with Claude Code

…es SP ranks

Distribute whole images across Ulysses SP ranks for parallelized ViT computation,
reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction).

Key changes:
- Add roll/utils/context_parallel/vision_dp.py with image distribution utilities,
  GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper
- Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL,
  Qwen3-VL-MoE VisionTransformer classes
- Integrate into DeepSpeed strategy (both inference and training workers)
- Add 17 unit tests covering all utility functions, edge cases, and integration workflows

Ported from verl (verl-project/verl#5230).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CLAassistant
Copy link

CLAassistant commented Feb 16, 2026

CLA assistant check
All committers have signed the CLA.

Integrate upstream hf_flash_attention_patch for transformers>=4.53.0
alongside Vision DP patches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments