Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/models/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ Documentation for available models and model architectures.
dllm
rae_siglip
sit
llava_onevision1_5
llava_onevision2
106 changes: 106 additions & 0 deletions docs/models/llava_onevision2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# LLaVA-OneVision2 Training

## Overview

LLaVA-OneVision2 (OV2) is the LMMs-Lab successor to LLaVA-OneVision 1.5. The
8B-Instruct checkpoint pairs a custom OneVision vision encoder (SigLIP-like
ViT with 3D RoPE and a patch-merger) with a stock **Qwen3-8B** language
model. Modeling code is shipped via Hugging Face ``auto_map`` and is loaded
at runtime through ``trust_remote_code``.

## Supported Features

| Feature | Support |
|---------|---------|
| **FSDP2** | ✅ |
| **FlashAttention 2** | ✅ |
| **Liger Kernel** | ✅ |
| **RMPAD (sequence packing)** | ✅ |
| **Packing** | ✅ |
| **Ulysses Sequence Parallel** | ✅ (via Qwen3 inner LM) |

## Quick Start

- **Example Config**: [examples/llava_onevision2/example.yaml](../../examples/llava_onevision2/example.yaml)
- **Run Script**: [examples/llava_onevision2/run.sh](../../examples/llava_onevision2/run.sh)

```bash
bash examples/llava_onevision2/run.sh
```

## How Monkey Patching Works

Because OV2's modeling classes are loaded dynamically (no shared import
path), patches are applied at the **model instance** level. Two patch_types
are registered for ``model_type == "llava_onevision2"``:

* ``"liger"`` – Liger kernels: RoPE, RMSNorm, SwiGLU MLP (inner Qwen3 LM),
LayerNorm (OV2 vision encoder), plus a fused linear cross-entropy bound
onto OV2's ``ForConditionalGeneration.forward``.
* ``"rmpad"`` – Sequence-packing (unpadded) attention path: class-level
patches to inner Qwen3 attention/decoder/model forwards so they consume
``cu_seq_lens``/``indices``, and an outer ``model_forward`` that wires
rmpad metadata through to ``causal_lm_forward``.

The runner appends them in order based on ``trainer_args``:

| `use_liger_kernel` | `use_rmpad` | Resulting behaviour |
|---|---|---|
| ✅ | ✅ | rmpad + fused LCE (historical default) |
| ✅ | ❌ | fused LCE, no unpadding |
| ❌ | ✅ | unpadded attention, standard CE |
| ❌ | ❌ | stock HF forward |

## Key Configuration

```yaml
model_config:
load_from_pretrained_path: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
attn_implementation: flash_attention_2
torch_dtype: bfloat16
model_type: llava_onevision2
extra_kwargs:
trust_remote_code: true # required: OV2 ships modeling via auto_map

dataset_config:
dataset_type: qwen3_vl_iterable
processor_config:
processor_name: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
processor_type: llava_onevision2
packing: true
packing_length: 8192

trainer_args:
use_liger_kernel: true
use_rmpad: true
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap:
- Qwen3DecoderLayer # inner LM (stock Qwen3)
- OneVisionEncoderEncoderLayer # OV2 vision tower
```

## Data Processor

``LlavaOnevision2DataProcessor`` inherits from ``Qwen3_VLDataProcessor``
and:

1. Uses the OV2 ``AutoProcessor`` (image_processor + video_processor)
loaded with ``trust_remote_code=True``.
2. Rewrites each chat-template ``<vision_start><video_pad><vision_end>``
into a sequence of per-frame blocks
``<X.X seconds><vision_start><image_pad>*n<vision_end>`` and aliases the
video patch tensors into the image path (every frame becomes a
``[1, H, W]`` row of ``image_grid_thw``).
3. Computes the block-layout ``patch_positions`` tensor required by the
OV2 vision tower's 3D RoPE.
4. Normalizes per-frame numpy arrays from ``qwen_vl_utils`` (CHW float) to
HWC uint8 so OV2's video processor can PIL-convert them.

## Implementation Pointers

* Monkey patches: ``src/lmms_engine/models/llava_onevision2/monkey_patch.py``
* OV2 forward replacements: ``src/lmms_engine/models/llava_onevision2/llava_onevision2_ops.py``
* Shared LM loss helper (LCE / CE, rmpad shift, Ulysses SP):
``src/lmms_engine/models/common_ops/loss.py``
* Data processor: ``src/lmms_engine/datasets/processor/llava_onevision2_processor.py``
90 changes: 90 additions & 0 deletions examples/llava_onevision2/example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# LLaVA-OneVision2 (8B-Instruct) training example.
#
# This config drives the LMMs-Lab LLaVA-OneVision2 checkpoint, which ships
# its modeling and processor code via ``auto_map`` (trust_remote_code).
# The model_config.extra_kwargs.trust_remote_code flag is required so the
# runner forwards it through AutoConfig / AutoModelFor*ImageTextToText.

trainer_type: fsdp2_trainer

dataset_config:
dataset_type: qwen3_vl_iterable
dataset_format: yaml
datasets:
- path: data/LLaVA-Video-178K/llava_video_0_30_s_cap_oe.parquet
data_folder: /path/to/LLaVA-Video-178K
data_type: parquet
processor_config:
processor_name: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
processor_type: llava_onevision2
extra_kwargs:
image_max_pixels: 360448
image_min_pixels: 28800
video_max_pixels: 360448
video_min_pixels: 28800
packing: true
packing_strategy: balanced
packing_length: 8192
shuffle: true
filter_overlong: true
filter_overlong_workers: 8
video_sampling_strategy: fps
video_max_pixels: 360448
video_max_frames: 64
frame_num: 32
fps: 1
video_backend: qwen_vl_utils
extra_kwargs:
packing_kwargs:
num_buckets: 2

model_config:
load_from_pretrained_path: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
attn_implementation: flash_attention_2
torch_dtype: bfloat16
model_type: llava_onevision2
extra_kwargs:
trust_remote_code: true

trainer_args:
output_dir: ./output/llava_onevision2_training
do_train: true
do_eval: false
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-05
num_train_epochs: 1
max_steps: 1000
lr_scheduler_type: cosine
warmup_ratio: 0.03
logging_steps: 10
save_strategy: steps
save_steps: 500
save_total_limit: 2
bf16: true
tf32: true
dataloader_drop_last: true
dataloader_num_workers: 4
dataloader_prefetch_factor: 2
remove_unused_columns: false
gradient_checkpointing: true
# Liger kernel + sequence packing (rmpad). The OV2 monkey patch registers
# 'liger' and 'rmpad' patch_types independently; the runner stacks them
# so the final causal_lm_forward runs with loss_fn=lce + use_rmpad=True.
use_liger_kernel: true
use_rmpad: true
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap:
# Inner LM is stock Qwen3; OV2 ships its own vision encoder block.
- Qwen3DecoderLayer
- OneVisionEncoderEncoderLayer
reshard_after_forward: false
min_num_params: 0
sp_ulysses_degree: 1
reduce_dtype: bfloat16
output_dtype: bfloat16
report_to: none
seed: 42
optim: adamw_torch_fused
run_name: llava_onevision2_training
47 changes: 47 additions & 0 deletions examples/llava_onevision2/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/bin/bash

################################################################################
# LLaVA-OneVision2 (8B-Instruct) Training with FSDP2
################################################################################
#
# DESCRIPTION:
# Train the LMMs-Lab LLaVA-OneVision2 checkpoint with FSDP2, sequence
# packing (rmpad), and Liger fused linear cross-entropy.
#
# KEY NOTES:
# - OV2 ships its modeling + processor code via auto_map. We forward
# trust_remote_code through AutoConfig / AutoModelFor*ImageTextToText
# so the remote code path is honored. The yaml sets:
# model_config.extra_kwargs.trust_remote_code: true
# - Inner LM is stock Qwen3, so most liger / rmpad work is delegated to
# the qwen3 monkey patch. OV2-specific bits (outer model.forward,
# vision LayerNorm, video token expansion) live under
# ``src/lmms_engine/models/llava_onevision2``.
# - Video frames go through the same image path as multi-image inputs;
# the data processor rewrites <video_pad> into per-frame
# ``<X.X seconds><vision_start><image_pad>*n<vision_end>`` blocks.
#
# REQUIREMENTS:
# - 8x GPUs (A100/H100 with 80GB recommended)
# - flash-attn: pip install flash-attn --no-build-isolation
# - liger-kernel: pip install liger-kernel
#
# DATASET:
# OpenAI chat format (JSONL/Arrow/Parquet); see docs/user_guide/data_prep.md.
#
################################################################################

NGPUS=8

# Auto-accept trust_remote_code prompts triggered by transitive HF auto_*
# loads (the explicit kwarg we pass should already cover the main path).
export HF_HUB_TRUST_REMOTE_CODE=1
export TRUST_REMOTE_CODE=1

torchrun --nproc_per_node=${NGPUS} \
--nnodes=1 \
--node_rank=0 \
--master_addr=127.0.0.1 \
--master_port=12356 \
-m lmms_engine.launch.cli \
config_yaml=examples/llava_onevision2/example.yaml
2 changes: 2 additions & 0 deletions src/lmms_engine/datasets/processor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from .bagel_processor import BagelDataProcessor
from .base_qwen2_5_processor import BaseQwen2_5_DataProcessor
from .config import ProcessorConfig
from .llava_onevision2_processor import LlavaOnevision2DataProcessor
from .llava_processor import LLaVADataProcessor
from .llava_video_processor import LLaVAVideoDataProcessor
from .nanovlm_processor import NanovlmDataProcessor
Expand Down Expand Up @@ -34,4 +35,5 @@
"RaeSiglipDataProcessor",
"SitDataProcessor",
"Qwen3_VLDataProcessor",
"LlavaOnevision2DataProcessor",
]
Loading
Loading