Skip to content

fix mask return-type contract regression and add correctness guard for#47019

Open
kaixuanliu wants to merge 8 commits into
huggingface:mainfrom
kaixuanliu:mask_fix_xpu
Open

fix mask return-type contract regression and add correctness guard for#47019
kaixuanliu wants to merge 8 commits into
huggingface:mainfrom
kaixuanliu:mask_fix_xpu

Conversation

@kaixuanliu

@kaixuanliu kaixuanliu commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

CI

This PR fixes issue: #46962

  1. fix return-type regression for create_masks_for_generate func
  2. Add strict restriction for _can_skip_causal_mask_xpu: only when kv_offset==0 can we use fast path
    @ArthurZucker @Cyrilvallez pls help review. Thx!

xpu

Signed-off-by: kaixuanliu <kaixuan.liu@intel.com>
Signed-off-by: kaixuanliu <kaixuan.liu@intel.com>
Signed-off-by: kaixuanliu <kaixuan.liu@intel.com>

@vasqu vasqu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smaller initial comments

Comment thread src/transformers/masking_utils.py Outdated
Comment on lines +363 to +372
if isinstance(attention_mask, dict):
causal_mask = attention_mask[self.config.layer_types[0]]
else:
causal_mask = create_causal_mask(
config=self.config,
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
past_key_values=past_key_values,
position_ids=position_ids,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we follow the pattern of

# It may already have been prepared by e.g. `generate`
if not isinstance(causal_mask_mapping := attention_mask, dict):
# Prepare mask arguments
mask_kwargs = {
"config": self.config,
"inputs_embeds": inputs_embeds,
"attention_mask": attention_mask,
"past_key_values": past_key_values,
"position_ids": position_ids,
}
# Create the masks
causal_mask_mapping = {
"full_attention": create_causal_mask(**mask_kwargs),
"sliding_attention": create_sliding_window_causal_mask(**mask_kwargs),
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this looks better. Have updated the code.

kaixuanliu and others added 3 commits July 2, 2026 22:43
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Signed-off-by: kaixuanliu <kaixuan.liu@intel.com>
Signed-off-by: kaixuanliu <kaixuan.liu@intel.com>
"past_key_values": past_key_values,
"position_ids": position_ids,
}
causal_mask_mapping = {self.config.layer_types[0]: create_causal_mask(**mask_kwargs)}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nit can we directly use the layer type that is intended here? E.g. "deepseek_sparse_attention"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@vasqu

vasqu commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Also cc @Cyrilvallez to double check in the end please 🫡

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v32, glm_moe_dsa

Signed-off-by: kaixuanliu <kaixuan.liu@intel.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

CI recap

Dashboard: View test results in Grafana
Latest run: 28632012077:2
Result: success | Jobs: 15 | Tests: 65,281 | Failures: 0 | Duration: 18h 36m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants