Releases · quantumaikr/quant.cpp

05 Apr 06:36

unamedkr

v0.5.0

d78b3f8

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM Latest

Latest

What's New

Gemma 4 26B-A4B MoE Support

Full support for Gemma 4's hybrid MoE architecture: 128 experts, dual-FFN, hybrid attention (sliding + full), QK-norm, learned RoPE, GeGLU activation. Generates correct answers in English and Korean.

7x KV Cache Compression

Same hardware, 7x longer context, zero quality loss.

Model	FP16 KV	quant.cpp KV	Gain
Llama 3.2 3B (16GB Mac)	50K tokens	350K tokens	6.9x
Gemma 4 26B (16GB Mac)	4K tokens	30K tokens	6.9x

New Models

Llama 3.2 3B Instruct — 17 tok/s, correct code generation
Gemma 4 26B-A4B-it — 3.9 tok/s, 128-expert MoE

WASM Browser Demo

192KB binary. Drag and drop a GGUF model, chat in the browser. Everything runs client-side.
→ Try it

Windows (MSVC) Support

Compiles with Visual Studio 2019/2022. pthread shim, C11 atomics compat.

quant.h Synced

Single header now includes Gemma 4, Llama 3, IQ3_XXS support. cc app.c -lm -lpthread — done.

Documentation

API Reference — 730 lines, all platforms
Custom Quantization Guide — add your own KV type in 3 functions
ROADMAP — project direction

Performance

Gemma 4 26B: 549ms → 257ms/token (-53%)
Metal GPU: 7 compute kernels implemented (infrastructure for batch inference)

Bug Fixes

Gemma 4 NaN regression, Llama head_dim misdetection
TQ_STATIC_ASSERT in C mode, stack buffer overflow
Zero build warnings, 34/34 tests pass, score 99.2%

Full changelog: CHANGELOG.md

Full Changelog: v0.2.0...v0.5.0

Assets 4

03 Apr 16:56

github-actions

v0.2.0

4c7b7df

v0.2.0

Full Changelog: v0.1.0...v0.2.0

Assets 4

31 Mar 14:31

unamedkr

v0.1.0

ee6f8de

v0.1.0 — Multi-Architecture Engine with KV Cache Compression

TurboQuant.cpp v0.1.0 — First Release

Multi-architecture LLM inference engine in pure C with KV cache compression.

Highlights

3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
TQM format: pre-quantized mmap binary, instant loading
Zero dependencies: libc only, ~1MB binary

Supported Models

Model	Speed (Q4, 6T)	Quality
Gemma 3 4B	5.2 tok/s	"capital of France" → "Paris"
Qwen3.5-0.8B	82 tok/s	0.999 cosine vs PyTorch
Gemma 3 270M	176 tok/s	per-layer exact match

KV Cache Memory Savings

Gemma 3 4B at 32K context:
  llama.cpp (FP16 KV):    4,352 MB
  TurboQuant (Q4 KV):     1,156 MB  ← 3.8x compression

Quick Start

git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"

What's Inside

9,000+ lines of pure C — complete inference engine
8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
Q4 weight quantization with NEON 2-row batching + thread pool
Integer Q4×Q8 attention via ARM vdotq_s32
20 test suites, 70+ tests
Python bindings (ctypes), llama.cpp/vLLM integration stubs

References

TurboQuant (ICLR 2026) — KV cache compression
QJL (AAAI 2025) — 1-bit quantized JL transform
PolarQuant (AISTATS 2026) — Polar coordinate quantization

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

Gemma 4 26B-A4B MoE Support

7x KV Cache Compression

New Models

WASM Browser Demo

Windows (MSVC) Support

quant.h Synced

Documentation

Performance

Bug Fixes

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

TurboQuant.cpp v0.1.0 — First Release

Highlights

Supported Models

KV Cache Memory Savings

Quick Start

What's Inside

References

Uh oh!

Releases: quantumaikr/quant.cpp

v0.5.0 — Gemma 4 MoE + 7x KV Compression + WASM

What's New

Gemma 4 26B-A4B MoE Support

7x KV Cache Compression

New Models

WASM Browser Demo

Windows (MSVC) Support

quant.h Synced

Documentation

Performance

Bug Fixes

Uh oh!

v0.2.0

Uh oh!

v0.1.0 — Multi-Architecture Engine with KV Cache Compression

TurboQuant.cpp v0.1.0 — First Release

Highlights

Supported Models

KV Cache Memory Savings

Quick Start

What's Inside

References

Uh oh!