-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Thank you for your excellent work. I noticed that in the earlier version of the code, the AttnProjection module used causal attention (which was also explicitly described in the early version of the arxiv paper: https://arxiv.org/pdf/2502.20321v1).
To ensure compatibility with autoregressive generation, the factorization blocks are configured with causal attention.
However, it seems that in the current codebase it has been changed to standard bidirectional attention:
Line 41 in bb8012e
| x = scaled_dot_product_attention(q, k, v) |
May I ask what motivated this change, and whether there were any specific considerations behind it?
Or more directly: what differences would these two choices bring in practice? Specifically in terms of reconstruction, semantic accuracy, understanding capability, and autoregressive generation performance?