Releases: quantumaikr/quant.cpp
Releases · quantumaikr/quant.cpp
v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM
What's New
Gemma 4 26B-A4B MoE Support
Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.
7x KV Cache Compression
Same hardware, 7x longer context, zero quality loss.
| Model | FP16 KV | quant.cpp KV | Gain |
|---|---|---|---|
| Llama 3.2 3B (16GB Mac) | 50K tokens | 350K tokens | 6.9x |
| Gemma 4 26B (16GB Mac) | 4K tokens | 30K tokens | 6.9x |
New Models
- Llama 3.2 3B Instruct — 17 tok/s, correct code generation
- Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE
WASM Browser Demo
192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it
Windows (MSVC) Support
Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.
quant.h Synced
Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.
Documentation
- API Reference — 730 lines, all platforms
- Custom Quantization Guide — add your own KV type in 3 functions
- ROADMAP — project direction
Performance
- Gemma 4 26B: 549ms → 257ms/token (-53%)
- Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)
Bug Fixes
- Gemma 4 NaN regression, Llama head_dim misdetection
- TQ_STATIC_ASSERT in C mode, stack buffer overflow
- Zero build warnings, 34/34 tests pass, score 99.2%
Full changelog: CHANGELOG.md
Full Changelog: v0.2.0...v0.5.0
v0.2.0
Full Changelog: v0.1.0...v0.2.0
v0.1.0 — Multi-Architecture Engine with KV Cache Compression
TurboQuant.cpp v0.1.0 — First Release
Multi-architecture LLM inference engine in pure C with KV cache compression.
Highlights
- 3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
- 3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
- llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
- Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
- Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
- TQM format: pre-quantized mmap binary, instant loading
- Zero dependencies: libc only, ~1MB binary
Supported Models
| Model | Speed (Q4, 6T) | Quality |
|---|---|---|
| Gemma 3 4B | 5.2 tok/s | "capital of France" → "Paris" |
| Qwen3.5-0.8B | 82 tok/s | 0.999 cosine vs PyTorch |
| Gemma 3 270M | 176 tok/s | per-layer exact match |
KV Cache Memory Savings
Gemma 3 4B at 32K context:
llama.cpp (FP16 KV): 4,352 MB
TurboQuant (Q4 KV): 1,156 MB ← 3.8x compression
Quick Start
git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"What's Inside
- 9,000+ lines of pure C — complete inference engine
- 8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
- Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
- Q4 weight quantization with NEON 2-row batching + thread pool
- Integer Q4×Q8 attention via ARM vdotq_s32
- 20 test suites, 70+ tests
- Python bindings (ctypes), llama.cpp/vLLM integration stubs
References
- TurboQuant (ICLR 2026) — KV cache compression
- QJL (AAAI 2025) — 1-bit quantized JL transform
- PolarQuant (AISTATS 2026) — Polar coordinate quantization