11L VRL + Parallel Muon + Legal TTT v2 (val_bpb=1.1269, non-record)#1016
11L VRL + Parallel Muon + Legal TTT v2 (val_bpb=1.1269, non-record)#1016ADIITJ wants to merge 2 commits intoopenai:mainfrom
Conversation
…ssion 3-seed results (1337, 42, 45): mean val_bpb=1.1264, artifact ~15.8MB. Forked from PR openai#549 (1.1194 SOTA). Adds VRL, BigramHash 3072, Tight SWA, zstd-22, sliding window eval fix. Drops Full GPTQ. TTT enabled by default.
There was a problem hiding this comment.
Pull request overview
Adds a new 10min/16MB track record entry implementing VRL (value residual learning) on top of the PR #549 architecture, along with BigramHash(3072), zstd compression, sliding-window eval adjustments, and default-on legal score-first TTT.
Changes:
- Introduces a new
train_gpt.pyrecord script with VRL, Parallel Muon banking, int6+zstd export, sliding-window eval, and legal score-first SGD TTT. - Adds record metadata (
submission.json), README, and 3 seed training logs for reproducibility.
Reviewed changes
Copilot reviewed 2 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-27_VRL_ParallelMuon_LegalTTT_v2/train_gpt.py | New end-to-end training/eval/export script for this record (VRL + Parallel Muon + sliding-window eval + int6+zstd + legal TTT). |
| records/track_10min_16mb/2026-03-27_VRL_ParallelMuon_LegalTTT_v2/submission.json | Record metadata for leaderboard/track cataloging. |
| records/track_10min_16mb/2026-03-27_VRL_ParallelMuon_LegalTTT_v2/README.md | Human-readable summary, architecture, and reproduction steps. |
| records/track_10min_16mb/2026-03-27_VRL_ParallelMuon_LegalTTT_v2/logs/train_seed1337.log | Seed run log (reported metrics, bytes, timings). |
| records/track_10min_16mb/2026-03-27_VRL_ParallelMuon_LegalTTT_v2/logs/train_seed42.log | Seed run log (reported metrics, bytes, timings). |
| records/track_10min_16mb/2026-03-27_VRL_ParallelMuon_LegalTTT_v2/logs/train_seed45.log | Seed run log (reported metrics, bytes, timings). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if _COMPRESSOR == "zstd": | ||
| quant_blob = zstandard.ZstdCompressor(level=22).compress(quant_raw) | ||
| else: | ||
| quant_blob = lzma.compress(quant_raw, preset=6) | ||
| if master_process: |
There was a problem hiding this comment.
Compression fallback is inconsistent: when zstandard isn't available _COMPRESSOR is set to "zlib", but this branch actually uses lzma.compress/lzma.decompress. This misreports the compressor in logs/filenames and defeats the intended zlib fallback. Consider either (a) switching the fallback to zlib.compress/zlib.decompress (and tuning level), or (b) renaming _COMPRESSOR to "lzma" and dropping the unused zlib import.
| self.value_residual = value_residual | ||
| if value_residual: | ||
| self.vr_lambda = nn.Parameter(torch.tensor([0.5, 0.5], dtype=torch.float32)) | ||
| def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: | ||
| """Efficient XSA: subtract self-value projection via GQA-aware reshape (no repeat_interleave). | ||
| y: [B, T, H, D], v: [B, T, Hkv, D]. H must be divisible by Hkv.""" | ||
| B, T, H, D = y.shape | ||
| Hkv = v.size(-2) | ||
| group = H // Hkv | ||
| y_g = y.reshape(B, T, Hkv, group, D) # [B, T, Hkv, group, D] | ||
| vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] -- broadcast ready | ||
| proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn | ||
| return (y_g - proj).reshape(B, T, H, D) | ||
| def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: | ||
| bsz, seqlen, dim = x.shape | ||
| q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim) | ||
| k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) | ||
| v = F.linear(x, v_w.to(x.dtype)) | ||
| if v_embed is not None: | ||
| v = v + v_embed | ||
| v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) | ||
| raw_v = v if self.value_residual else None | ||
| if self.value_residual and v0 is not None: | ||
| lam = self.vr_lambda.to(dtype=v.dtype) | ||
| v = lam[0] * v0 + lam[1] * v | ||
| q = F.rms_norm(q, (q.size(-1),)) |
There was a problem hiding this comment.
VRL is documented in the PR description/README as using “sigmoid gates”, but the implementation mixes v0 and v with unconstrained parameters (vr_lambda) and no sigmoid/normalization. This can lead to negative or >1 mixing coefficients and doesn’t match the stated method. Either apply a sigmoid/softmax to the gate parameters (or otherwise constrain them), or update the docs/metadata to reflect the actual parameterization.
| chunk_seqs = (chunk_end - chunk_start) // seq_len | ||
| if chunk_seqs > 0: | ||
| cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) | ||
| for pg in optimizer.param_groups: | ||
| pg['lr'] = cos_lr | ||
| my_seq_s = (chunk_seqs * rank) // world_size | ||
| my_seq_e = (chunk_seqs * (rank + 1)) // world_size | ||
| my_chunk_seqs = my_seq_e - my_seq_s | ||
| for _ep in range(args.ttt_epochs): | ||
| for bs in range(0, my_chunk_seqs, args.ttt_batch_seqs): | ||
| be = min(bs + args.ttt_batch_seqs, my_chunk_seqs) | ||
| actual_bs = my_seq_s + bs | ||
| start_tok = chunk_start + actual_bs * seq_len | ||
| end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 | ||
| if end_tok > val_tokens.numel(): | ||
| continue | ||
| local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) | ||
| x = local[:-1].reshape(-1, seq_len) | ||
| y = local[1:].reshape(-1, seq_len) | ||
| optimizer.zero_grad(set_to_none=True) | ||
| with torch.autocast(device_type="cuda", dtype=torch.bfloat16): | ||
| loss = base_model(x, y) | ||
| loss.backward() | ||
| if world_size > 1: | ||
| for p in ttt_params: | ||
| if p.grad is not None: | ||
| dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) | ||
| torch.nn.utils.clip_grad_norm_(ttt_params, args.ttt_grad_clip) |
There was a problem hiding this comment.
eval_val_sliding_ttt can deadlock in distributed runs when chunk_seqs < world_size (e.g. if TTT_CHUNK_TOKENS is reduced): some ranks will have my_chunk_seqs == 0 and skip the training loop entirely, while other ranks enter the per-parameter dist.all_reduce(p.grad, ...) and block forever. Add a guard to ensure all ranks participate in the same collectives (e.g., skip TTT training for all ranks when chunk_seqs < world_size, or restructure to all-reduce a single flattened buffer and call it unconditionally each step).
| y_attn = flash_attn_3_func(q, k, v, causal=True) | ||
| if self.attn.use_xsa: | ||
| y_attn = self.attn._xsa_efficient(y_attn, v) | ||
| y_flat = y_attn.reshape(bsz, seqlen, dim) | ||
| # out_proj input is y_flat | ||
| _accum(f"blocks.{block_idx}.attn.proj.weight", y_flat) | ||
| attn_out = F.linear(y_flat, out_w.to(x_normed.dtype)) | ||
| raw_v = v if self.attn.value_residual else None |
There was a problem hiding this comment.
collect_hessians patches Block.forward and reimplements attention, but it omits the gated_attention path (the attn_gate(x) sigmoid multiply) that exists in CausalSelfAttention.forward. If GATED_ATTENTION=1 is used with FULL_GPTQ=1, the collected Hessians won’t correspond to the actual forward pass. Consider mirroring the gated-attention logic in the patched forward (or explicitly disallow FULL_GPTQ when gated attention is enabled).
| | Seed | Steps | Pre-quant bpb | Post-quant bpb | Post-TTT bpb | Artifact bytes | Valid | | ||
| |------|-------|---------------|----------------|--------------|----------------|-------| | ||
| | 1337 | 6170 | 1.1443 | 1.1524 | 1.1268 | 15,828,109 | Yes | | ||
| | 42 | 6177 | 1.1428 | 1.1506 | 1.1253 | 15,828,109 | Yes | | ||
| | 45 | 6182 | 1.1435 | 1.1528 | 1.1270 | 15,813,731 | Yes | |
There was a problem hiding this comment.
The markdown table is malformed: the header and rows start with || which creates an extra empty column and renders incorrectly in many markdown viewers. Use a single leading | for each row (and ensure the separator row has matching column count).
| | 1337 | 6170 | 1.1443 | 1.1524 | 1.1268 | 15,828,109 | Yes | | ||
| | 42 | 6177 | 1.1428 | 1.1506 | 1.1253 | 15,828,109 | Yes | | ||
| | 45 | 6182 | 1.1435 | 1.1528 | 1.1270 | 15,813,731 | Yes | |
There was a problem hiding this comment.
The 3-seed table claims seed 42 has Artifact bytes 15,828,109 (under the 16MB cap), but logs/train_seed42.log reports Serialized model int6+zstd: 16,543,449 bytes and Total submission size ...: 16,645,615 bytes (over the cap). Please reconcile which value is correct and ensure the seed-42 artifact is actually < 16,000,000 bytes if this is intended as a 3-seed validated submission.
| "val_bpb_seeds": [1.12684, 1.12527, 1.12704], | ||
| "seeds": [1337, 42, 45], | ||
| "status": "pending_validation", | ||
| "blurb": "Forked from PR #549 (abaybektursun, 1.1194 SOTA). Additions: (1) VRL (Value Residual Learning, arxiv:2410.17897) on all 11 layers via learned sigmoid gates; (2) BigramHash 1536->3072 (free -0.0009 bpb); (3) Tight SWA preferred over EMA; (4) zstd-22 compression replacing lzma; (5) Sliding window eval bug fix. Removed Full GPTQ (added overhead without meaningful compression gain). TTT enabled by default with freeze_blocks=0. 6170-6183 steps at ~97ms/step.", | ||
| "architecture": "11L_512dim_8h_4kv_XSA4_PartialRoPE16_LNScale_VE128_VRL_LeakyReLU_ParallelMuon", | ||
| "base_submission": "PR #549 (abaybektursun, 1.1194)", | ||
| "bytes_total": 15828109, |
There was a problem hiding this comment.
submission.json doesn’t follow the common metadata schema used by other track entries (e.g., it omits val_loss and bytes_code, and introduces nonstandard fields like status/val_bpb_seeds). If there’s downstream tooling that aggregates submissions, this can break parsing/comparisons. Suggest aligning with the established pattern in this repo’s other records/track_10min_16mb/*/submission.json files (e.g. include val_loss, bytes_code, optional per-seed breakdown under seed_results, and keep date format consistent).
| "val_bpb_seeds": [1.12684, 1.12527, 1.12704], | |
| "seeds": [1337, 42, 45], | |
| "status": "pending_validation", | |
| "blurb": "Forked from PR #549 (abaybektursun, 1.1194 SOTA). Additions: (1) VRL (Value Residual Learning, arxiv:2410.17897) on all 11 layers via learned sigmoid gates; (2) BigramHash 1536->3072 (free -0.0009 bpb); (3) Tight SWA preferred over EMA; (4) zstd-22 compression replacing lzma; (5) Sliding window eval bug fix. Removed Full GPTQ (added overhead without meaningful compression gain). TTT enabled by default with freeze_blocks=0. 6170-6183 steps at ~97ms/step.", | |
| "architecture": "11L_512dim_8h_4kv_XSA4_PartialRoPE16_LNScale_VE128_VRL_LeakyReLU_ParallelMuon", | |
| "base_submission": "PR #549 (abaybektursun, 1.1194)", | |
| "bytes_total": 15828109, | |
| "val_loss": 1.12639, | |
| "seeds": [1337, 42, 45], | |
| "seed_results": [ | |
| { "seed": 1337, "val_bpb": 1.12684 }, | |
| { "seed": 42, "val_bpb": 1.12527 }, | |
| { "seed": 45, "val_bpb": 1.12704 } | |
| ], | |
| "blurb": "Forked from PR #549 (abaybektursun, 1.1194 SOTA). Additions: (1) VRL (Value Residual Learning, arxiv:2410.17897) on all 11 layers via learned sigmoid gates; (2) BigramHash 1536->3072 (free -0.0009 bpb); (3) Tight SWA preferred over EMA; (4) zstd-22 compression replacing lzma; (5) Sliding window eval bug fix. Removed Full GPTQ (added overhead without meaningful compression gain). TTT enabled by default with freeze_blocks=0. 6170-6183 steps at ~97ms/step.", | |
| "architecture": "11L_512dim_8h_4kv_XSA4_PartialRoPE16_LNScale_VE128_VRL_LeakyReLU_ParallelMuon", | |
| "base_submission": "PR #549 (abaybektursun, 1.1194)", | |
| "bytes_total": 15828109, | |
| "bytes_code": 0, |
Seed 42 artifact was 16.6MB (over 16MB limit), not 15.8MB as reported. Valid seeds reduced to 1337 and 45 only. Updated to non-record submission.
Community Review — 11L VRL + Parallel Muon + Legal TTT v2 (val_bpb=1.1269, non-record)BPB: 1.1269 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1109 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102166 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102166 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Changes from PR #549 (SOTA 1.1194)
3-Seed Results
Test plan
torchrun --standalone --nproc_per_node=8 train_gpt.pyon 8xH100