Fix assistant_masks for multimodal inputs in apply_chat_template by umbilnm · Pull Request #44543 · huggingface/transformers

umbilnm · 2026-03-09T10:47:05Z

What does this PR do?

apply_chat_template with return_assistant_tokens_mask=True returns all-zero masks when multimodal inputs (images/videos) are present.

Root cause

generation_indices (character-level positions of assistant responses) are computed from the original prompt text rendered by Jinja, which contains a single placeholder token per image (e.g. one <|image_pad|>). However, the processor's __call__ expands each placeholder into N copies (based on image resolution), so offset_mapping returned by the tokenizer corresponds to the expanded text. The bisect_left lookup then fails to find the assistant span, and the mask stays all zeros.

Fix

When multimodal inputs are present:

Tokenize the original (unexpanded) prompt separately to get offset_mapping aligned with generation_indices
Build the assistant mask on the original tokenization (where bisect_left works correctly)
Map the mask onto the expanded input_ids via two-pointer alignment — matching tokens get their mask value, extra expansion tokens get 0

This approach is generic and works for any multimodal processor that expands placeholder tokens, without requiring model-specific logic.

When no multimodal inputs are present, the original code path is used unchanged.

Tests

Added test_apply_chat_template_assistant_mask_with_image in test_processing_common.py. Verified on Qwen2.5-VL and Qwen3-VL:

Without fix: FAIL (mask is all zeros)
With fix: PASS (mask correctly marks assistant tokens)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@zucchini-nlp (author of the original return_assistant_tokens_mask support in PR #38545, multimodal models)

Fix assistant_masks for multimodal inputs in apply_chat_template

61629ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix assistant_masks for multimodal inputs in apply_chat_template#44543

Fix assistant_masks for multimodal inputs in apply_chat_template#44543
umbilnm wants to merge 1 commit intohuggingface:mainfrom
umbilnm:fix/multimodal-assistant-masks

umbilnm commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

umbilnm commented Mar 9, 2026

What does this PR do?

Root cause

Fix

Tests

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant