fix(rpa-v3): add sliding window mask to h64 kernel and attention_sink to h128 #1185
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1169
This PR fixes two issues in the ragged-paged attention v3 kernels:
kernel_hd64.py (h64): Added missing sliding window mask in the kernel. The
original code only skipped fetching KV blocks outside the window, but didn't apply
token-level masking within partially-covered blocks.
kernel.py (h128): Added attention_sink support following the same pattern as the
h64 kernel. Attention sinks allow the model to "dump" attention to a virtual token that
doesn't contribute to the output.
Changes