feat: add Gemma4 VL support#141
Open
anxiangsir wants to merge 6 commits into
Open
Conversation
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
There was a problem hiding this comment.
Pull request overview
Adds end-to-end Gemma4-VL model support to LLaVA-OneVision-2, including the Gemma4 hybrid-attention LLM, vision tower, adapter, multimodal data plumbing, HF↔mcore checkpoint converters, quick-start training scripts, and accompanying smoke/consistency tests.
Changes:
- New
gemma4_vlmodel family (LLM with hybrid sliding/full attention, K=V tying, per-layer scalars; Gemma4 ViT + adapter) and registration through provider/arguments/data plugins/chat template. - New
tools/convert_checkpoint/custom/gemma4_vl/converters (LLM, ViT body, vision patch, adapter, and merger) plus two-stage quick-start shell scripts. - New regression tests in
tests/_shared/and atests/consistency_gemma4/collection-smoke fixture.
Reviewed changes
Copilot reviewed 51 out of 53 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| aiak_training_llm/models/gemma4_vl/* | Gemma4 LLM/ViT/adapter/provider definitions |
| aiak_training_llm/data/chat_templete.py | Registers gemma4 chat template (turn delimiter strings look malformed) |
| aiak_training_llm/data/mm_plugin.py | Adds Gemma4VLPlugin; introduces a redundant torch import |
| aiak_training_llm/train/arguments.py | Adds Gemma4 architecture-specific CLI args; list/dict defaults lack type= |
| aiak_training_llm/train/pretrain/pretrain_gemma4_vl.py | Gemma4 pretraining entry; contains dead is_video / attn_mask_type assignments |
| tools/convert_checkpoint/custom/gemma4_vl/{llm,vit,vision_patch,adapter,merge_megatron}.py | HF↔mcore conversion + merger; carry stale Baidu copyright headers |
| examples/gemma4_vl/quick_start_26b_a4b/stage{1,2}_*.sh | Quick-start scripts with shebang placed mid-file rather than line 1 |
| tests/consistency_gemma4/conftest.py | Consistency fixture; passes --chat-template qwen2-vl for a Gemma4 model |
| tests/shared/test_gemma4*.py | New smoke tests for window/RoPE/attention mask/vision state dict/skeleton |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+467
to
+470
| slots=["<|turn>user\n{{content}}<turn|>\n<|turn>model\n"] | ||
| ), | ||
| format_assistant=StringFormatter(slots=["{{content}}<turn|>\n"]), | ||
| format_system=StringFormatter(slots=["<|turn>system\n{{content}}<turn|>\n"]), |
| TOKENIZER_PATH=${TOKENIZER_PATH:-"/ov2/pretrain_models/google/gemma-4-26B-A4B-it"} | ||
| CHECKPOINT_PATH=${CHECKPOINT_PATH:-"/workspace/LLaVA-OneVision-2/stage_0_gemma4_26b_a4b_release"} | ||
|
|
||
| #! /bin/bash |
| TOKENIZER_PATH=${TOKENIZER_PATH:-"/ov2/pretrain_models/google/gemma-4-26B-A4B-it"} | ||
| CHECKPOINT_PATH=${CHECKPOINT_PATH:-"/workspace/LLaVA-OneVision-2/stage_0_gemma4_26b_a4b_release"} | ||
|
|
||
| #! /bin/bash |
Comment on lines
+104
to
+105
| "--chat-template", | ||
| "qwen2-vl", |
| attn_mask = build_gemma4_mm_attention_mask( | ||
| tokens, mm_token_type_ids, sliding_window=sliding_window | ||
| ) | ||
| attn_mask_type = AttnMaskType.causal |
| video_grid_thw = tensor_parallel.broadcast_data(["video_grid_thw"], data, torch.int32)["video_grid_thw"] | ||
|
|
||
| packed_seq_params = None | ||
| is_video = video_token_id in tokens |
Comment on lines
17
to
18
| import torch | ||
|
|
Comment on lines
+143
to
+152
| group.add_argument('--layer-pattern', default=[], | ||
| help='Gemma4 per-layer attention pattern. Usually set by --model-name.') | ||
| group.add_argument('--per-layer-kv-channels', default={}, | ||
| help='Gemma4 per-layer-type head dim overrides. Usually set by --model-name.') | ||
| group.add_argument('--per-layer-num-query-groups', default={}, | ||
| help='Gemma4 per-layer-type KV head overrides. Usually set by --model-name.') | ||
| group.add_argument('--attention-k-eq-v', action='store_true', default=False, | ||
| help='Enable Gemma4 K=V tying. Usually set by --model-name.') | ||
| group.add_argument('--kv-tied-layers', default=[], | ||
| help='Gemma4 K=V tied layer indices. Usually set by --model-name.') |
Comment on lines
+209
to
+212
|
|
||
| # K=V: value shares the K tensor. No clone — downstream kernels (TE/flash-attn) | ||
| # treat K and V as read-only. Saves one tensor allocation per layer. | ||
| value = key |
Comment on lines
+3
to
+7
| ################################################################################ | ||
| # | ||
| # Copyright (c) 2024 Baidu.com, Inc. All Rights Reserved | ||
| # | ||
| ################################################################################ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation
pytest tests/_shared/test_gemma4_per_layer_window_size.py tests/_shared/test_gemma4_rope_inv_freq.py tests/_shared/test_gemma4_vision_state_dict.py tests/_shared/test_gemma4_vl_attention_mask.py tests/_shared/test_gemma4_vl_skeleton.py tests/consistency_gemma4/test_model_consistency.py::test_collection_smoke -v→ 25 passed./ov2/pretrain_models/google/gemma-4-26B-A4B-itintotmp_test_gemma4_mcore_ckpt_conversion_fresh.tmp_test_gemma4_roundtrip_hf_fresh; original vs roundtrip HF had 1013/1013 keys, 0 missing/extra/shape mismatches, and full tensor equality withmax_diff=0.0.tmp_test_gemma4_mcore_ckpt_tp1_pp1_ep1; rerun against the fresh checkpoint was blocked by current A800 GPU memory pressure/OOM, not by checkpoint key/layout errors.Notes