Skip to content

Submission/hybrid rwkv token shift#1007

Open
dillon-blake wants to merge 4 commits intoopenai:mainfrom
dillon-blake:submission/hybrid-rwkv-token-shift
Open

Submission/hybrid rwkv token shift#1007
dillon-blake wants to merge 4 commits intoopenai:mainfrom
dillon-blake:submission/hybrid-rwkv-token-shift

Conversation

@dillon-blake
Copy link
Copy Markdown

Non-record submission exploring hybrid transformer architectures that replace most attention layers with a lightweight RWKV-inspired token-shift mixing mechanism. The core idea is that most layers in a transformer only need local context, so full quadratic attention is wasteful for them. Instead, 8 of 11 layers use a simple token-shift operation that blends adjacent tokens via learned per-dimension interpolation weights, while only 3 layers retain quadratic attention with short (128-token) windows (except the final attention layer which keeps full context).

The architecture achieves a 3-seed mean val_bpb of 1.2252 with 17.0M parameters, int6 quantized and zlib compressed to ~15.86 MB. While this does not beat the current SOTA, I believe the token-shift approach is promising for its efficiency — particularly for inference, where the reduced attention overhead could significantly speed up decoding.

Beyond the hybrid architecture, the submission stacks several techniques from the leaderboard: SmearGate, bigram hash embeddings, value embeddings, XSA (cross-head suppression), partial RoPE (16/64 dims), LeakyReLU squared activation, Muon optimizer, EMA with late QAT, and logit softcapping. Full details and ablation notes are in the README.

dillon-blake and others added 4 commits March 27, 2026 17:58
… 1.2252)

11-layer hybrid transformer with 8 RWKV-style token-shift layers and 3
short-window attention layers. 17.0M params, int6 quantized + zlib
compressed to ~15.86 MB. 3-seed mean val_bpb: 1.2252.
Updated the README to clarify the performance observations and methodologies related to hybrid transformer architectures and attention mechanisms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant