Fix assistant_masks for multimodal inputs in apply_chat_template#44543
Open
umbilnm wants to merge 1 commit intohuggingface:mainfrom
Open
Fix assistant_masks for multimodal inputs in apply_chat_template#44543umbilnm wants to merge 1 commit intohuggingface:mainfrom
umbilnm wants to merge 1 commit intohuggingface:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #44521
apply_chat_templatewithreturn_assistant_tokens_mask=Truereturns all-zero masks when multimodal inputs (images/videos) are present.Root cause
generation_indices(character-level positions of assistant responses) are computed from the original prompt text rendered by Jinja, which contains a single placeholder token per image (e.g. one<|image_pad|>). However, the processor's__call__expands each placeholder into N copies (based on image resolution), sooffset_mappingreturned by the tokenizer corresponds to the expanded text. Thebisect_leftlookup then fails to find the assistant span, and the mask stays all zeros.Fix
When multimodal inputs are present:
offset_mappingaligned withgeneration_indicesbisect_leftworks correctly)input_idsvia two-pointer alignment — matching tokens get their mask value, extra expansion tokens get 0This approach is generic and works for any multimodal processor that expands placeholder tokens, without requiring model-specific logic.
When no multimodal inputs are present, the original code path is used unchanged.
Tests
Added
test_apply_chat_template_assistant_mask_with_imageintest_processing_common.py. Verified on Qwen2.5-VL and Qwen3-VL:Before submitting
Who can review?
@zucchini-nlp (author of the original
return_assistant_tokens_masksupport in PR #38545, multimodal models)