Submission/hybrid rwkv token shift by dillon-blake · Pull Request #1007 · openai/parameter-golf

dillon-blake · 2026-03-28T06:26:55Z

Non-record submission exploring hybrid transformer architectures that replace most attention layers with a lightweight RWKV-inspired token-shift mixing mechanism. The core idea is that most layers in a transformer only need local context, so full quadratic attention is wasteful for them. Instead, 8 of 11 layers use a simple token-shift operation that blends adjacent tokens via learned per-dimension interpolation weights, while only 3 layers retain quadratic attention with short (128-token) windows (except the final attention layer which keeps full context).

The architecture achieves a 3-seed mean val_bpb of 1.2252 with 17.0M parameters, int6 quantized and zlib compressed to ~15.86 MB. While this does not beat the current SOTA, I believe the token-shift approach is promising for its efficiency — particularly for inference, where the reduced attention overhead could significantly speed up decoding.

Beyond the hybrid architecture, the submission stacks several techniques from the leaderboard: SmearGate, bigram hash embeddings, value embeddings, XSA (cross-head suppression), partial RoPE (16/64 dims), LeakyReLU squared activation, Muon optimizer, EMA with late QAT, and logit softcapping. Full details and ablation notes are in the README.

… 1.2252) 11-layer hybrid transformer with 8 RWKV-style token-shift layers and 3 short-window attention layers. 17.0M params, int6 quantized + zlib compressed to ~15.86 MB. 3-seed mean val_bpb: 1.2252.

Updated the README to clarify the performance observations and methodologies related to hybrid transformer architectures and attention mechanisms.

dillon-blake and others added 4 commits March 27, 2026 17:58

Ready to run

a5e3ec1

WIP

4a69bc6

Non-record: Hybrid RWKV Token-Shift + Short Window Attention (val_bpb…

1401d7d

… 1.2252) 11-layer hybrid transformer with 8 RWKV-style token-shift layers and 3 short-window attention layers. 17.0M params, int6 quantized + zlib compressed to ~15.86 MB. 3-seed mean val_bpb: 1.2252.

Revise README for clarity on hybrid transformer research

9d1a09c

Updated the README to clarify the performance observations and methodologies related to hybrid transformer architectures and attention mechanisms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission/hybrid rwkv token shift#1007

Submission/hybrid rwkv token shift#1007
dillon-blake wants to merge 4 commits intoopenai:mainfrom
dillon-blake:submission/hybrid-rwkv-token-shift

dillon-blake commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dillon-blake commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant