Skip to content

Releases: quantumaikr/quant.cpp

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

05 Apr 06:36

Choose a tag to compare

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model FP16 KV quant.cpp KV Gain
Llama 3.2 3B (16GB Mac) 50K tokens 350K tokens 6.9x
Gemma 4 26B (16GB Mac) 4K tokens 30K tokens 6.9x

New Models

  • Llama 3.2 3B Instruct — 17 tok/s, correct code generation
  • Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

Performance

  • Gemma 4 26B: 549ms → 257ms/token (-53%)
  • Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

  • Gemma 4 NaN regression, Llama head_dim misdetection
  • TQ_STATIC_ASSERT in C mode, stack buffer overflow
  • Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0

v0.2.0

03 Apr 16:56

Choose a tag to compare

Full Changelog: v0.1.0...v0.2.0

v0.1.0 — Multi-Architecture Engine with KV Cache Compression

31 Mar 14:31

Choose a tag to compare

TurboQuant.cpp v0.1.0 — First Release

Multi-architecture LLM inference engine in pure C with KV cache compression.

Highlights

  • 3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
  • 3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
  • llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
  • Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
  • Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
  • TQM format: pre-quantized mmap binary, instant loading
  • Zero dependencies: libc only, ~1MB binary

Supported Models

Model Speed (Q4, 6T) Quality
Gemma 3 4B 5.2 tok/s "capital of France" → "Paris"
Qwen3.5-0.8B 82 tok/s 0.999 cosine vs PyTorch
Gemma 3 270M 176 tok/s per-layer exact match

KV Cache Memory Savings

Gemma 3 4B at 32K context:
  llama.cpp (FP16 KV):    4,352 MB
  TurboQuant (Q4 KV):     1,156 MB  ← 3.8x compression

Quick Start

git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"

What's Inside

  • 9,000+ lines of pure C — complete inference engine
  • 8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
  • Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
  • Q4 weight quantization with NEON 2-row batching + thread pool
  • Integer Q4×Q8 attention via ARM vdotq_s32
  • 20 test suites, 70+ tests
  • Python bindings (ctypes), llama.cpp/vLLM integration stubs

References

  • TurboQuant (ICLR 2026) — KV cache compression
  • QJL (AAAI 2025) — 1-bit quantized JL transform
  • PolarQuant (AISTATS 2026) — Polar coordinate quantization