[feat] add LLaVA-OneVision2 (8B-Instruct) training support#170
Merged
Conversation
AutoConfig.from_pretrained was called without trust_remote_code so any checkpoint that registers its config via auto_map raised before model class resolution. Even after fixing that, the resolved config class is not in any HF model mapping, so add an auto_map fallback that picks the AutoModelFor* class declared by the checkpoint.
When a model_type registers only a subset of patch_types, looking up the missing entry in the inner dict raised KeyError. Log and return instead so multi-patch invocations can probe different models without raising.
OV2 ships its modeling code via auto_map / trust_remote_code, so all
patches are applied at the model instance level. Inner LM is stock
Qwen3, so the qwen3 class-level liger/rmpad patches do most of the work;
the OV2-specific bits are:
- llava_onevision2/monkey_patch.py registers two patch_types:
- liger: bind causal_lm_forward with loss_fn=lce + module-level liger
swaps (rms_norm/swiglu in Qwen3 layers, layer_norm in OV2 vision)
- rmpad: bind OV2 model_forward + rebind causal_lm_forward with
use_rmpad=True, preserving liger's loss_fn if already bound
- llava_onevision2/llava_onevision2_ops.py: replacement forwards for the
outer LlavaOnevision2Model and ForConditionalGeneration; loss is
delegated to the shared compute_lm_loss helper
- common_ops/loss.py: shared next-token loss with optional fused LCE,
rmpad seq-wise shifting, and Ulysses SP gather
- datasets/processor/llava_onevision2_processor.py: inherits
Qwen3_VLDataProcessor; rewrites the chat template's
<vision_start><video_pad><vision_end> into per-frame
<X.X seconds><vision_start><image_pad>*n<vision_end> blocks and
aliases video tensors into the image path. Includes a normalizer for
CHW float frames from qwen_vl_utils so they hit OV2 video processor's
list[np.ndarray] branch.
- runner.py: auto-append 'rmpad' patch_type when trainer_args.use_rmpad
- models/utils.py: route llava_onevision2 flops through the qwen2 path
using text_config (its LM is Qwen3)
Adds examples/llava_onevision2/{example.yaml,run.sh} and
docs/models/llava_onevision2.md describing the auto_map / trust_remote_code
loading flow, the split liger / rmpad monkey-patch composition, and the
data processor's video-as-multi-image rewrite. Also registers the new doc
in the models index (and adds the missing llava_onevision1_5 entry that
was already present as a file).
yiyexy
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Add training support for the LMMs-Lab LLaVA-OneVision2 8B-Instruct
checkpoint. OV2 ships its modeling and processor code via
`auto_map` (trust_remote_code) and pairs a custom OneVision vision
encoder with a stock Qwen3-8B language model, so most of the
machinery (liger, rmpad, fused LCE) can be reused from the existing
qwen3 plumbing — the OV2-specific work is mostly glue.
Modifications
Model loading
`trust_remote_code` to `AutoConfig.from_pretrained` and falls
back to the AutoModelFor* class declared in the checkpoint's
`auto_map` when the config class is not registered in any HF
model mapping (the case for all trust_remote_code checkpoints).
through.
Monkey patch infra
patch_types that are not registered for a given model_type instead
of raising `KeyError`. This lets us auto-append new patch_types
from the runner without breaking existing models.
OV2 monkey patches (instance-only, since auto_map)
on OV2 vision encoder; binds OV2 `causal_lm_forward` with
`loss_fn="lce"`, `use_rmpad=False`.
rmpad patches, OV2 outer `model_forward` binding, and a rebind
of `causal_lm_forward` with `use_rmpad=True` that detects the
liger-bound `loss_fn` (if any) to preserve fused LCE.
`trainer_args.use_rmpad=True`.
Shared LM loss helper
reusable next-token loss path (LCE / CE, optional rmpad seq-wise
shifting, Ulysses SP gather). OV2 uses it; future RFCs that split
liger/rmpad on other models can reuse it too.
Data processor
`Qwen3_VLDataProcessor`. Loads the OV2 `AutoProcessor` with
`trust_remote_code=True`, rewrites
`<vision_start><video_pad><vision_end>` into per-frame
`<X.X seconds><vision_start><image_pad>*n<vision_end>` blocks,
aliases video tensors into the image path (expands
`video_grid_thw[T,H,W]` into T rows of `[1,H,W]`), builds
block-layout `patch_positions`, and normalizes `qwen_vl_utils`
CHW float frames to HWC uint8 so OV2's video processor's
`PIL.Image.fromarray` branch works.
Misc
qwen2 estimator using `config.text_config` (its LM is Qwen3).
training scaffold.
composition + data processor; registered in
`docs/models/index.rst` (also picks up the existing-but-unlinked
`llava_onevision1_5.md`).
Testing
liger + sequence packing kicks off training successfully against
`LLaVA-Video-178K` parquet shards; loss decreases over a short
smoke run with no NaNs.
Commit log
Checklist