llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP speculative decoding for ~30-50% throughput gains
-
Updated
May 8, 2026 - C++
llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP speculative decoding for ~30-50% throughput gains
DMax: Aggressive Parallel Decoding for dLLMs
Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090
Curated collection of research on the limitations of next-token prediction and methods that go beyond it.
Optimized vLLM setup for Qwen3.6-27B-FP8 on dual RTX PRO 6000 Blackwell (192 GB GDDR7, no NVLink) ; config, benchmark sweep results, and custom chat template with thinking mode off by default.
ChemMiniQ3-SAbRLo is a lightweight experimental generative model for chemistry, built on mini Qwen2-like arch, designed for rapid prototyping of HuggingFace AutoModel and AutoTokenizer compatibility, and fast iteration of Multi-Token Prediction (MTP) and RL fine-tuning algorithms/rewards.
Multi-Token Prediction benchmarks for Gemma 4 on Apple Silicon — LiteRT-LM, transformers, and llama.cpp at batch=1 on a MacBook M4 Pro. ~2× speedup reproducible in one specific runtime.
A lightweight experimental generative model for chemistry, with mini Qwen2-like architecture and horizon loss and biologically-aware RL fine-tuning on SELFIES molecular representations.
Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.
Research code for ProbeRoute, a probe-initialized sparse routing method for frozen-backbone multi-token prediction
Reverse-engineering how DeepSeek achieved frontier LLM performance at a fraction of the cost — through hands-on PyTorch implementations of MLA, MoE, MTP, RoPE, and quantization.
Add a description, image, and links to the multi-token-prediction topic page so that developers can more easily learn about it.
To associate your repository with the multi-token-prediction topic, visit your repo's landing page and select "manage topics."