Skip to content

Fix assistant_masks for multimodal inputs in apply_chat_template#44543

Open
umbilnm wants to merge 1 commit intohuggingface:mainfrom
umbilnm:fix/multimodal-assistant-masks
Open

Fix assistant_masks for multimodal inputs in apply_chat_template#44543
umbilnm wants to merge 1 commit intohuggingface:mainfrom
umbilnm:fix/multimodal-assistant-masks

Conversation

@umbilnm
Copy link

@umbilnm umbilnm commented Mar 9, 2026

What does this PR do?

Fixes #44521

apply_chat_template with return_assistant_tokens_mask=True returns all-zero masks when multimodal inputs (images/videos) are present.

Root cause

generation_indices (character-level positions of assistant responses) are computed from the original prompt text rendered by Jinja, which contains a single placeholder token per image (e.g. one <|image_pad|>). However, the processor's __call__ expands each placeholder into N copies (based on image resolution), so offset_mapping returned by the tokenizer corresponds to the expanded text. The bisect_left lookup then fails to find the assistant span, and the mask stays all zeros.

Fix

When multimodal inputs are present:

  1. Tokenize the original (unexpanded) prompt separately to get offset_mapping aligned with generation_indices
  2. Build the assistant mask on the original tokenization (where bisect_left works correctly)
  3. Map the mask onto the expanded input_ids via two-pointer alignment — matching tokens get their mask value, extra expansion tokens get 0

This approach is generic and works for any multimodal processor that expands placeholder tokens, without requiring model-specific logic.

When no multimodal inputs are present, the original code path is used unchanged.

Tests

Added test_apply_chat_template_assistant_mask_with_image in test_processing_common.py. Verified on Qwen2.5-VL and Qwen3-VL:

  • Without fix: FAIL (mask is all zeros)
  • With fix: PASS (mask correctly marks assistant tokens)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp (author of the original return_assistant_tokens_mask support in PR #38545, multimodal models)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apply_chat_template returns all-zero assistant_masks for multimodal inputs

1 participant