multi-token-prediction

Here are 11 public repositories matching this topic...

AtomicBot-ai / atomic-llama-cpp-turboquant

llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP speculative decoding for ~30-50% throughput gains

Updated May 8, 2026
C++

czg1225 / DMax

Star

DMax: Aggressive Parallel Decoding for dLLMs

acceleration efficiency large-language-models parallel-decoding diffusion-language-models multi-token-prediction

Updated May 8, 2026
Python

Indras-Mirror / llama.cpp-mtp

Star

Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090

cuda quantization mtp kv-cache fwht llama-cpp flash-attention qwen speculative-decoding rtx-4090 multi-token-prediction turboquant tbq4 tensor-sharing

Updated May 11, 2026
C++

JaydenTeoh / beyond-next-token-prediction

Star

Curated collection of research on the limitations of next-token prediction and methods that go beyond it.

machine-learning transformers sequence-to-sequence language-model next-token-prediction multi-token-prediction

Updated May 9, 2026

theogravity / dual-rtx-6000-blackwell-qwen3.6-27b-fp8

Sponsor

Star

Optimized vLLM setup for Qwen3.6-27B-FP8 on dual RTX PRO 6000 Blackwell (192 GB GDDR7, no NVLink) ; config, benchmark sweep results, and custom chat template with thinking mode off by default.

benchmark blackwell fp8 vllm local-llm llm-inference speculative-decoding qwen3 multi-token-prediction rtx-pro-6000

Updated May 10, 2026
Shell

ChemMiniQ3-SAbRLo is a lightweight experimental generative model for chemistry, built on mini Qwen2-like arch, designed for rapid prototyping of HuggingFace AutoModel and AutoTokenizer compatibility, and fast iteration of Multi-Token Prediction (MTP) and RL fine-tuning algorithms/rewards.

experimental cheminformatics transformer molecular-generation molecular-generative-models custom-gpt qwen2 multi-token-prediction selfies-strings

Updated Oct 1, 2025
Python

iprajax / gemma4-mtp

Star

Multi-Token Prediction benchmarks for Gemma 4 on Apple Silicon — LiteRT-LM, transformers, and llama.cpp at batch=1 on a MacBook M4 Pro. ~2× speedup reproducible in one specific runtime.

macos benchmark metal transformers gemma mlx edge-ai on-device-ai apple-silicon llama-cpp speculative-decoding multi-token-prediction litert-lm gemma-4 on-edge-llm

Updated May 7, 2026
HTML

gbyuvd / ChemMiniQ3-HoriFIE

Star

A lightweight experimental generative model for chemistry, with mini Qwen2-like architecture and horizon loss and biologically-aware RL fine-tuning on SELFIES molecular representations.

experimental cheminformatics transformer molecular-generation molecular-generative-models custom-gpt qwen2 multi-token-prediction selfies-strings

Updated Oct 1, 2025
Python

theogravity / dual-rtx-6000-blackwell-Gemma-4-31B-IT-NVFP4

Sponsor

Star

Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.

docker amd cuda gemma blackwell vllm llm-inference am5 speculative-decoding fp4 prefix-caching multi-token-prediction nvfp4 rtx-6000 gemma4 tensor-parallel

Updated May 10, 2026
Shell

aliuyar1234 / proberoute

Star

Research code for ProbeRoute, a probe-initialized sparse routing method for frozen-backbone multi-token prediction

machine-learning transformers pytorch language-models efficient-llm multi-token-prediction sparse-routing

Updated Apr 18, 2026
Python

chandan11248 / deepseek-innovations-from-scratch

Star

Reverse-engineering how DeepSeek achieved frontier LLM performance at a fraction of the cost — through hands-on PyTorch implementations of MLA, MoE, MTP, RoPE, and quantization.

deep-learning from-scratch mla mixture-of-experts quantization-aware-training large-language-models llm deepseek rotary-embeddings multi-token-prediction

Updated Feb 26, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the multi-token-prediction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the multi-token-prediction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-token-prediction

Here are 11 public repositories matching this topic...

AtomicBot-ai / atomic-llama-cpp-turboquant

czg1225 / DMax

Indras-Mirror / llama.cpp-mtp

JaydenTeoh / beyond-next-token-prediction

theogravity / dual-rtx-6000-blackwell-qwen3.6-27b-fp8

gbyuvd / ChemMiniQ3-SAbRLo

iprajax / gemma4-mtp

gbyuvd / ChemMiniQ3-HoriFIE

theogravity / dual-rtx-6000-blackwell-Gemma-4-31B-IT-NVFP4

aliuyar1234 / proberoute

chandan11248 / deepseek-innovations-from-scratch

Improve this page

Add this topic to your repo