Skip to content

feat: add Gemma4 VL support#141

Open
anxiangsir wants to merge 6 commits into
llava_onevision2from
feat/gemma4-vl
Open

feat: add Gemma4 VL support#141
anxiangsir wants to merge 6 commits into
llava_onevision2from
feat/gemma4-vl

Conversation

@anxiangsir
Copy link
Copy Markdown
Collaborator

Summary

  • Add Gemma4-VL model family registration plus language, vision, adapter, and multimodal data pipeline support.
  • Add HF↔mcore Gemma4 checkpoint conversion tools and 26B-A4B quick-start scripts.
  • Add Gemma4 regression and consistency smoke tests covering attention masks, RoPE, vision state dicts, model skeleton, and fixture loading.

Validation

  • pytest tests/_shared/test_gemma4_per_layer_window_size.py tests/_shared/test_gemma4_rope_inv_freq.py tests/_shared/test_gemma4_vision_state_dict.py tests/_shared/test_gemma4_vl_attention_mask.py tests/_shared/test_gemma4_vl_skeleton.py tests/consistency_gemma4/test_model_consistency.py::test_collection_smoke -v → 25 passed.
  • HF→mcore fresh conversion completed for /ov2/pretrain_models/google/gemma-4-26B-A4B-it into tmp_test_gemma4_mcore_ckpt_conversion_fresh.
  • mcore→HF roundtrip completed into tmp_test_gemma4_roundtrip_hf_fresh; original vs roundtrip HF had 1013/1013 keys, 0 missing/extra/shape mismatches, and full tensor equality with max_diff=0.0.
  • Existing full consistency passed previously against tmp_test_gemma4_mcore_ckpt_tp1_pp1_ep1; rerun against the fresh checkpoint was blocked by current A800 GPU memory pressure/OOM, not by checkpoint key/layout errors.

Notes

  • This PR intentionally excludes unrelated dirty worktree files such as offline packing changes, generated checkpoints, logs, and temporary scripts.
  • The new shared config/argument fields default to no-op values and are only activated by the Gemma4 model config.

anxiangsir and others added 6 commits May 17, 2026 23:34
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Copilot AI review requested due to automatic review settings May 17, 2026 15:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end Gemma4-VL model support to LLaVA-OneVision-2, including the Gemma4 hybrid-attention LLM, vision tower, adapter, multimodal data plumbing, HF↔mcore checkpoint converters, quick-start training scripts, and accompanying smoke/consistency tests.

Changes:

  • New gemma4_vl model family (LLM with hybrid sliding/full attention, K=V tying, per-layer scalars; Gemma4 ViT + adapter) and registration through provider/arguments/data plugins/chat template.
  • New tools/convert_checkpoint/custom/gemma4_vl/ converters (LLM, ViT body, vision patch, adapter, and merger) plus two-stage quick-start shell scripts.
  • New regression tests in tests/_shared/ and a tests/consistency_gemma4/ collection-smoke fixture.

Reviewed changes

Copilot reviewed 51 out of 53 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
aiak_training_llm/models/gemma4_vl/* Gemma4 LLM/ViT/adapter/provider definitions
aiak_training_llm/data/chat_templete.py Registers gemma4 chat template (turn delimiter strings look malformed)
aiak_training_llm/data/mm_plugin.py Adds Gemma4VLPlugin; introduces a redundant torch import
aiak_training_llm/train/arguments.py Adds Gemma4 architecture-specific CLI args; list/dict defaults lack type=
aiak_training_llm/train/pretrain/pretrain_gemma4_vl.py Gemma4 pretraining entry; contains dead is_video / attn_mask_type assignments
tools/convert_checkpoint/custom/gemma4_vl/{llm,vit,vision_patch,adapter,merge_megatron}.py HF↔mcore conversion + merger; carry stale Baidu copyright headers
examples/gemma4_vl/quick_start_26b_a4b/stage{1,2}_*.sh Quick-start scripts with shebang placed mid-file rather than line 1
tests/consistency_gemma4/conftest.py Consistency fixture; passes --chat-template qwen2-vl for a Gemma4 model
tests/shared/test_gemma4*.py New smoke tests for window/RoPE/attention mask/vision state dict/skeleton

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +467 to +470
slots=["<|turn>user\n{{content}}<turn|>\n<|turn>model\n"]
),
format_assistant=StringFormatter(slots=["{{content}}<turn|>\n"]),
format_system=StringFormatter(slots=["<|turn>system\n{{content}}<turn|>\n"]),
TOKENIZER_PATH=${TOKENIZER_PATH:-"/ov2/pretrain_models/google/gemma-4-26B-A4B-it"}
CHECKPOINT_PATH=${CHECKPOINT_PATH:-"/workspace/LLaVA-OneVision-2/stage_0_gemma4_26b_a4b_release"}

#! /bin/bash
TOKENIZER_PATH=${TOKENIZER_PATH:-"/ov2/pretrain_models/google/gemma-4-26B-A4B-it"}
CHECKPOINT_PATH=${CHECKPOINT_PATH:-"/workspace/LLaVA-OneVision-2/stage_0_gemma4_26b_a4b_release"}

#! /bin/bash
Comment on lines +104 to +105
"--chat-template",
"qwen2-vl",
attn_mask = build_gemma4_mm_attention_mask(
tokens, mm_token_type_ids, sliding_window=sliding_window
)
attn_mask_type = AttnMaskType.causal
video_grid_thw = tensor_parallel.broadcast_data(["video_grid_thw"], data, torch.int32)["video_grid_thw"]

packed_seq_params = None
is_video = video_token_id in tokens
Comment on lines 17 to 18
import torch

Comment on lines +143 to +152
group.add_argument('--layer-pattern', default=[],
help='Gemma4 per-layer attention pattern. Usually set by --model-name.')
group.add_argument('--per-layer-kv-channels', default={},
help='Gemma4 per-layer-type head dim overrides. Usually set by --model-name.')
group.add_argument('--per-layer-num-query-groups', default={},
help='Gemma4 per-layer-type KV head overrides. Usually set by --model-name.')
group.add_argument('--attention-k-eq-v', action='store_true', default=False,
help='Enable Gemma4 K=V tying. Usually set by --model-name.')
group.add_argument('--kv-tied-layers', default=[],
help='Gemma4 K=V tied layer indices. Usually set by --model-name.')
Comment on lines +209 to +212

# K=V: value shares the K tensor. No clone — downstream kernels (TE/flash-attn)
# treat K and V as read-only. Saves one tensor allocation per layer.
value = key
Comment on lines +3 to +7
################################################################################
#
# Copyright (c) 2024 Baidu.com, Inc. All Rights Reserved
#
################################################################################
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants