EvolvingLMMs-Lab · yiyexy · May 14, 2026 · May 14, 2026 · May 14, 2026 · May 14, 2026
diff --git a/docs/models/index.rst b/docs/models/index.rst
@@ -20,3 +20,5 @@ Documentation for available models and model architectures.
    dllm
    rae_siglip
    sit
+   llava_onevision1_5
+   llava_onevision2
diff --git a/docs/models/llava_onevision2.md b/docs/models/llava_onevision2.md
@@ -0,0 +1,106 @@
+# LLaVA-OneVision2 Training
+
+## Overview
+
+LLaVA-OneVision2 (OV2) is the LMMs-Lab successor to LLaVA-OneVision 1.5. The
+8B-Instruct checkpoint pairs a custom OneVision vision encoder (SigLIP-like
+ViT with 3D RoPE and a patch-merger) with a stock **Qwen3-8B** language
+model. Modeling code is shipped via Hugging Face ``auto_map`` and is loaded
+at runtime through ``trust_remote_code``.
+
+## Supported Features
+
+| Feature | Support |
+|---------|---------|
+| **FSDP2** | ✅ |
+| **FlashAttention 2** | ✅ |
+| **Liger Kernel** | ✅ |
+| **RMPAD (sequence packing)** | ✅ |
+| **Packing** | ✅ |
+| **Ulysses Sequence Parallel** | ✅ (via Qwen3 inner LM) |
+
+## Quick Start
+
+- **Example Config**: [examples/llava_onevision2/example.yaml](../../examples/llava_onevision2/example.yaml)
+- **Run Script**: [examples/llava_onevision2/run.sh](../../examples/llava_onevision2/run.sh)
+
+```bash
+bash examples/llava_onevision2/run.sh
+```
+
+## How Monkey Patching Works
+
+Because OV2's modeling classes are loaded dynamically (no shared import
+path), patches are applied at the **model instance** level. Two patch_types
+are registered for ``model_type == "llava_onevision2"``:
+
+* ``"liger"`` – Liger kernels: RoPE, RMSNorm, SwiGLU MLP (inner Qwen3 LM),
+  LayerNorm (OV2 vision encoder), plus a fused linear cross-entropy bound
+  onto OV2's ``ForConditionalGeneration.forward``.
+* ``"rmpad"`` – Sequence-packing (unpadded) attention path: class-level
+  patches to inner Qwen3 attention/decoder/model forwards so they consume
+  ``cu_seq_lens``/``indices``, and an outer ``model_forward`` that wires
+  rmpad metadata through to ``causal_lm_forward``.
+
+The runner appends them in order based on ``trainer_args``:
+
+| `use_liger_kernel` | `use_rmpad` | Resulting behaviour |
+|---|---|---|
+| ✅ | ✅ | rmpad + fused LCE (historical default) |
+| ✅ | ❌ | fused LCE, no unpadding |
+| ❌ | ✅ | unpadded attention, standard CE |
+| ❌ | ❌ | stock HF forward |
+
+## Key Configuration
+
+```yaml
+model_config:
+  load_from_pretrained_path: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
+  attn_implementation: flash_attention_2
+  torch_dtype: bfloat16
+  model_type: llava_onevision2
+  extra_kwargs:
+    trust_remote_code: true        # required: OV2 ships modeling via auto_map
+
+dataset_config:
+  dataset_type: qwen3_vl_iterable
+  processor_config:
+    processor_name: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
+    processor_type: llava_onevision2
+  packing: true
+  packing_length: 8192
+
+trainer_args:
+  use_liger_kernel: true
+  use_rmpad: true
+  fsdp2: true
+  fsdp_config:
+    transformer_layer_cls_to_wrap:
+      - Qwen3DecoderLayer            # inner LM (stock Qwen3)
+      - OneVisionEncoderEncoderLayer # OV2 vision tower
+```
+
+## Data Processor
+
+``LlavaOnevision2DataProcessor`` inherits from ``Qwen3_VLDataProcessor``
+and:
+
+1. Uses the OV2 ``AutoProcessor`` (image_processor + video_processor)
+   loaded with ``trust_remote_code=True``.
+2. Rewrites each chat-template ``<vision_start><video_pad><vision_end>``
+   into a sequence of per-frame blocks
+   ``<X.X seconds><vision_start><image_pad>*n<vision_end>`` and aliases the
+   video patch tensors into the image path (every frame becomes a
+   ``[1, H, W]`` row of ``image_grid_thw``).
+3. Computes the block-layout ``patch_positions`` tensor required by the
+   OV2 vision tower's 3D RoPE.
+4. Normalizes per-frame numpy arrays from ``qwen_vl_utils`` (CHW float) to
+   HWC uint8 so OV2's video processor can PIL-convert them.
+
+## Implementation Pointers
+
+* Monkey patches: ``src/lmms_engine/models/llava_onevision2/monkey_patch.py``
+* OV2 forward replacements: ``src/lmms_engine/models/llava_onevision2/llava_onevision2_ops.py``
+* Shared LM loss helper (LCE / CE, rmpad shift, Ulysses SP):
+  ``src/lmms_engine/models/common_ops/loss.py``
+* Data processor: ``src/lmms_engine/datasets/processor/llava_onevision2_processor.py``
diff --git a/examples/llava_onevision2/example.yaml b/examples/llava_onevision2/example.yaml
@@ -0,0 +1,90 @@
+# LLaVA-OneVision2 (8B-Instruct) training example.
+#
+# This config drives the LMMs-Lab LLaVA-OneVision2 checkpoint, which ships
+# its modeling and processor code via ``auto_map`` (trust_remote_code).
+# The model_config.extra_kwargs.trust_remote_code flag is required so the
+# runner forwards it through AutoConfig / AutoModelFor*ImageTextToText.
+
+trainer_type: fsdp2_trainer
+
+dataset_config:
+  dataset_type: qwen3_vl_iterable
+  dataset_format: yaml
+  datasets:
+    - path: data/LLaVA-Video-178K/llava_video_0_30_s_cap_oe.parquet
+      data_folder: /path/to/LLaVA-Video-178K
+      data_type: parquet
+  processor_config:
+    processor_name: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
+    processor_type: llava_onevision2
+    extra_kwargs:
+      image_max_pixels: 360448
+      image_min_pixels: 28800
+      video_max_pixels: 360448
+      video_min_pixels: 28800
+  packing: true
+  packing_strategy: balanced
+  packing_length: 8192
+  shuffle: true
+  filter_overlong: true
+  filter_overlong_workers: 8
+  video_sampling_strategy: fps
+  video_max_pixels: 360448
+  video_max_frames: 64
+  frame_num: 32
+  fps: 1
+  video_backend: qwen_vl_utils
+  extra_kwargs:
+    packing_kwargs:
+      num_buckets: 2
+
+model_config:
+  load_from_pretrained_path: lmms-lab-ov2/LLaVA-OneVision2-8B-Instruct
+  attn_implementation: flash_attention_2
+  torch_dtype: bfloat16
+  model_type: llava_onevision2
+  extra_kwargs:
+    trust_remote_code: true
+
+trainer_args:
+  output_dir: ./output/llava_onevision2_training
+  do_train: true
+  do_eval: false
+  per_device_train_batch_size: 1
+  gradient_accumulation_steps: 1
+  learning_rate: 1.0e-05
+  num_train_epochs: 1
+  max_steps: 1000
+  lr_scheduler_type: cosine
+  warmup_ratio: 0.03
+  logging_steps: 10
+  save_strategy: steps
+  save_steps: 500
+  save_total_limit: 2
+  bf16: true
+  tf32: true
+  dataloader_drop_last: true
+  dataloader_num_workers: 4
+  dataloader_prefetch_factor: 2
+  remove_unused_columns: false
+  gradient_checkpointing: true
+  # Liger kernel + sequence packing (rmpad). The OV2 monkey patch registers
+  # 'liger' and 'rmpad' patch_types independently; the runner stacks them
+  # so the final causal_lm_forward runs with loss_fn=lce + use_rmpad=True.
+  use_liger_kernel: true
+  use_rmpad: true
+  fsdp2: true
+  fsdp_config:
+    transformer_layer_cls_to_wrap:
+      # Inner LM is stock Qwen3; OV2 ships its own vision encoder block.
+      - Qwen3DecoderLayer
+      - OneVisionEncoderEncoderLayer
+    reshard_after_forward: false
+    min_num_params: 0
+  sp_ulysses_degree: 1
+  reduce_dtype: bfloat16
+  output_dtype: bfloat16
+  report_to: none
+  seed: 42
+  optim: adamw_torch_fused
+  run_name: llava_onevision2_training
diff --git a/examples/llava_onevision2/run.sh b/examples/llava_onevision2/run.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+
+################################################################################
+# LLaVA-OneVision2 (8B-Instruct) Training with FSDP2
+################################################################################
+#
+# DESCRIPTION:
+#   Train the LMMs-Lab LLaVA-OneVision2 checkpoint with FSDP2, sequence
+#   packing (rmpad), and Liger fused linear cross-entropy.
+#
+# KEY NOTES:
+#   - OV2 ships its modeling + processor code via auto_map. We forward
+#     trust_remote_code through AutoConfig / AutoModelFor*ImageTextToText
+#     so the remote code path is honored. The yaml sets:
+#       model_config.extra_kwargs.trust_remote_code: true
+#   - Inner LM is stock Qwen3, so most liger / rmpad work is delegated to
+#     the qwen3 monkey patch. OV2-specific bits (outer model.forward,
+#     vision LayerNorm, video token expansion) live under
+#     ``src/lmms_engine/models/llava_onevision2``.
+#   - Video frames go through the same image path as multi-image inputs;
+#     the data processor rewrites <video_pad> into per-frame
+#     ``<X.X seconds><vision_start><image_pad>*n<vision_end>`` blocks.
+#
+# REQUIREMENTS:
+#   - 8x GPUs (A100/H100 with 80GB recommended)
+#   - flash-attn: pip install flash-attn --no-build-isolation
+#   - liger-kernel: pip install liger-kernel
+#
+# DATASET:
+#   OpenAI chat format (JSONL/Arrow/Parquet); see docs/user_guide/data_prep.md.
+#
+################################################################################
+
+NGPUS=8
+
+# Auto-accept trust_remote_code prompts triggered by transitive HF auto_*
+# loads (the explicit kwarg we pass should already cover the main path).
+export HF_HUB_TRUST_REMOTE_CODE=1
+export TRUST_REMOTE_CODE=1
+
+torchrun --nproc_per_node=${NGPUS} \
+  --nnodes=1 \
+  --node_rank=0 \
+  --master_addr=127.0.0.1 \
+  --master_port=12356 \
+  -m lmms_engine.launch.cli \
+  config_yaml=examples/llava_onevision2/example.yaml
diff --git a/src/lmms_engine/datasets/processor/__init__.py b/src/lmms_engine/datasets/processor/__init__.py
@@ -2,6 +2,7 @@
 from .bagel_processor import BagelDataProcessor
 from .base_qwen2_5_processor import BaseQwen2_5_DataProcessor
 from .config import ProcessorConfig
+from .llava_onevision2_processor import LlavaOnevision2DataProcessor
 from .llava_processor import LLaVADataProcessor
 from .llava_video_processor import LLaVAVideoDataProcessor
 from .nanovlm_processor import NanovlmDataProcessor
@@ -34,4 +35,5 @@
     "RaeSiglipDataProcessor",
     "SitDataProcessor",
     "Qwen3_VLDataProcessor",
+    "LlavaOnevision2DataProcessor",
 ]