fix(rpa-v3): add sliding window mask to h64 kernel and attention_sink to h128 #1185

erfanzar · 2025-11-26T18:17:11Z

Summary

This PR fixes two issues in the ragged-paged attention v3 kernels:

kernel_hd64.py (h64): Added missing sliding window mask in the kernel. The
original code only skipped fetching KV blocks outside the window, but didn't apply
token-level masking within partially-covered blocks.
kernel.py (h128): Added attention_sink support following the same pattern as the
h64 kernel. Attention sinks allow the model to "dump" attention to a virtual token that
doesn't contribute to the output.

Changes

File	Change
kernel_hd64.py	Added sliding window mask in flash_attention
kernel.py	Added attention_sink parameter to all functions

… to h128 Fixes vllm-project#1169 This PR fixes two issues in the ragged paged attention v3 kernels: 1. **kernel_hd64.py (h64)**: Added missing sliding window mask in the kernel. The original code only skipped fetching KV blocks outside the window but didn't apply token-level masking within partially-covered blocks. 2. **kernel.py (h128)**: Added attention_sink support following the same pattern as the h64 kernel. Attention sinks allow the model to "dump" attention to a virtual token that doesn't contribute to the output. Uses LEFT concatenation semantics where sink logits are prepended before softmax, then removed after. Changes: - kernel_hd64.py: Added `if sliding_window is not None` mask in flash_attention - kernel.py: Added attention_sink parameter to all functions (ref impl, kernel, prepare_inputs, validation, main function) - kernel.py: Initialize m_prev with sink values and l_prev with 1.0 for proper online softmax tracking across blocks when using attention_sink

kyuyeunk · 2025-11-26T18:25:20Z

Hi @erfanzar, thanks for the quick PR. Here are few comments:

Sliding mask related issue in hd64 variant is being taken care of in this PR: [RPA][Kernel] Update hd64 variant sliding window code #1180
Not adding attention sink feature into kernel.py and only kernel_hd64.py was a deliberate choice to streamline kernel.py codebase since the feature only seems to be used for gpt-oss - which has head dim size of 64. We will reevaluate this if there is non head dim 64 model that also requires attention sink.

erfanzar requested review from bythew3i, kyuyeunk and yaochengji as code owners November 26, 2025 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rpa-v3): add sliding window mask to h64 kernel and attention_sink to h128 #1185

fix(rpa-v3): add sliding window mask to h64 kernel and attention_sink to h128 #1185

Uh oh!

erfanzar commented Nov 26, 2025

Uh oh!

kyuyeunk commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(rpa-v3): add sliding window mask to h64 kernel and attention_sink to h128 #1185

Are you sure you want to change the base?

fix(rpa-v3): add sliding window mask to h64 kernel and attention_sink to h128 #1185

Uh oh!

Conversation

erfanzar commented Nov 26, 2025

Summary

Changes

Uh oh!

kyuyeunk commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants