quant.cpp/README.md at main · synapticore-io/quant.cpp

LLM inference with 7x longer context — pure C, zero dependencies

Lossless KV cache compression. Also ships as quant.h — a single-header library.
72K LOC. Embeddable. Read it in an afternoon.

The Problem

LLM memory is dominated by the KV cache, not model weights. At 32K context, a 8B model's KV cache consumes 4GB — more than the model itself. Every existing engine stores KV in FP16. We compress it.

┌─────────────────────────────────────────────────────────────┐
│                    16GB Mac Memory                          │
├──────────────┬──────────────────────────────────────────────┤
│  Model (4GB) │  KV Cache (FP16)                             │
│              │  ████████████████████████  8K context  ← OOM │
├──────────────┼──────────────────────────────────────────────┤
│  Model (4GB) │  KV (4-bit) ███  →→→→→  350K context         │
│              │              ↑ 6.9x smaller                  │
└──────────────┴──────────────────────────────────────────────┘

The Result

Same hardware. 7x longer context. Zero quality loss.

Hardware	Model	FP16 KV	quant.cpp KV	Gain
16GB Mac	Llama 3.2 3B	50K tokens	350K tokens	6.9x
16GB Mac	Gemma 4 26B MoE	4K tokens	30K tokens	6.9x
8GB Laptop	Llama 8B (Q4)	16K tokens	61K tokens	3.8x
24GB RTX 3090	Llama 8B (Q4)	147K tokens	559K tokens	3.8x

# One command. That's it.
./quant model.gguf -p "hello" -k uniform_4b -v q4

How It Compares

vs llama.cpp: Quality at same bit budget

                    KV Quantization Quality (SmolLM2 1.7B, WikiText-2)
                    
  llama.cpp Q4_0 KV │██████████████████████████████████████ PPL +10.6%  ← broken
                    │
   quant.cpp 4-bit  │▏ PPL +0.0%  ← lossless
                    │
   quant.cpp 3-bit  │█ PPL +1.3%  ← delta compression
                    └────────────────────────────────────────────────
                     0%                                         +12%
                              Perplexity Degradation →

vs every other engine

	quant.cpp	llama.cpp	vLLM	MLX	ONNX RT
KV compression	7x, +0% PPL	+10.6% PPL	--	--	--
Code size	72K LOC	250K+	100K+	50K+	500K+
Dependencies	zero	ggml	PyTorch	Apple fw	runtime
Embeddable	single header	--	--	--	complex
WASM	192KB	--	--	--	--
GPU serving	basic	full	best	Metal	multi

Use llama.cpp when you need speed. Use vLLM when you need throughput. Use quant.cpp when you need to fit more context in less memory — or embed LLM in your own app.

Supported Models

Model	Params	Architecture	Speed (M1 Pro, 8T)	KV Compression
SmolLM2 135M	135M	Llama	103 tok/s	2.4x
Llama 3.2 3B Instruct	3B	Llama 3 (GQA)	10 tok/s	6.9x
Gemma 4 26B-A4B-it	26B (4B active)	MoE 128 experts	3.9 tok/s	3.5x
Qwen3.5 0.8B	752M	DeltaNet hybrid	80 tok/s	3.8x
Qwen3.5 4B	4B	DeltaNet hybrid	20 tok/s	3.8x
SmolLM2 1.7B	1.7B	Llama	25 tok/s	3.8x
Gemma 3 270M	270M	Gemma 3	176 tok/s	3.8x

GGUF format. Load any llama.cpp-compatible model.

Gemma 4 26B-A4B architecture details

Full support for Gemma 4's hybrid MoE architecture:

Dual-FFN: parallel Dense MLP + 128-expert MoE per layer
Hybrid attention: 25 sliding (head_dim=256) + 5 full (head_dim=512) layers
QK-norm aware KV compression: auto FP32 keys + Q4 values (3.5x savings)
Learned RoPE with per-layer frequency factors
IQ3_XXS/IQ4_NL fused dot with NEON optimization for MoE experts
GeGLU activation (NEON-accelerated fast tanh approximation)

./build/quant gemma-4-26B-A4B-it-UD-Q3_K_M.gguf \
  -p "<start_of_turn>user\nWhat is the capital of France?\n<end_of_turn>\n<start_of_turn>model\n" \
  -n 50 -j 8 -T 0.0 -k uniform_4b -v q4
# Output: "The capital of France is **Paris**."

KV Cache Compression

The Idea

Standard:  Store every key as-is            → 16 bits/element → FP16

quant.cpp: Quantize keys to 4-bit           → 4 bits/element  → 3.8x
           + quantize values to Q4           → 4 bits/element  → 6.9x
           + delta encode adjacent keys      → 3 bits/element  → 8.5x

Like video compression: I-frames (FP32) every 64 tokens, P-frames (3-bit delta) between.

Quality vs Compression

                    WikiText-2 PPL (SmolLM2 1.7B)

  FP32 baseline      14.63 │ ●
  4b K + FP16 V       14.63 │ ● identical
  4b K + Q4 V         14.57 │ ● slightly better (!)
  delta 3b K + Q4 V   14.82 │  ●  +1.3%
  llama.cpp Q4 KV     16.18 │          ● +10.6%
  3b K (no delta)       ——  │                              ● +62%
                            └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──
                              14  15  16  17  18  19  20  21+

Modes

Config	Compression	PPL vs FP32	Best for
`delta + 3b K + Q4 V`	~8.5x	+1.3%	Maximum context
`delta + 4b K + Q4 V`	~6.9x	~0%	Quality + compression
`uniform_4b K + Q4 V`	6.9x	~0%	Simple, no delta overhead
`uniform_4b K + FP16 V`	1.6x	+0.0%	Lossless baseline

QK-norm Aware (Gemma 4)

Models with QK-norm normalize keys to the unit sphere, creating extremely sparse distributions. quant.cpp auto-detects this and stores keys in FP32 while quantizing only values — preserving perfect precision with 3.5x V memory reduction.

Quick Start

git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Basic inference
./build/quant model.gguf -p "hello"

# With KV compression (recommended)
./build/quant model.gguf -p "hello" -k uniform_4b -v q4

# Delta compression (maximum context)
./build/quant model.gguf -p "hello" -k uniform_3b -v q4 --delta

# Perplexity benchmark
./build/quant model.gguf --ppl input.txt -k uniform_4b -v q4

Single-Header Mode

Copy one file. Add LLM to any C project.

#define QUANT_IMPLEMENTATION
#include "quant.h"

int main() {
    quant_model* m = quant_load("model.gguf");
    quant_ctx*   c = quant_new(m, NULL);
    
    // Streaming
    quant_generate(c, "Tell me a joke", print_token, NULL);
    
    // Or one-shot
    char* answer = quant_ask(c, "What is 2+2?");
    printf("%s\n", answer);
    free(answer);
    
    quant_free_ctx(c);
    quant_free_model(m);
}

cc app.c -o app -lm -lpthread    # that's it — no cmake, no framework

15.7K LOC, 643KB, ~2s compile time. Full API:

Function	Description
`quant_load(path)`	Load a GGUF model
`quant_new(model, config)`	Create inference context
`quant_generate(ctx, prompt, cb, ud)`	Stream tokens via callback
`quant_ask(ctx, prompt)`	Generate and return string
`quant_free_ctx(ctx)`	Free context
`quant_free_model(model)`	Free model

Browser Demo (WASM)

192KB. The entire inference engine compiles to a WASM binary smaller than most JPEGs.

cd wasm && bash build.sh          # Requires: emscripten
python3 -m http.server 8080       # Serve locally
# Open http://localhost:8080, drag & drop any GGUF model

Everything runs client-side. Nothing is uploaded. KV compression active by default.

Docker & Server

Docker (zero-dependency, ~10MB image):

docker build -t quant.cpp .
docker run -v ./models:/models quant.cpp /models/model.gguf -p "hello" -k uniform_4b -v q4

OpenAI-compatible server (/v1/chat/completions):

cmake -B build -DTQ_BUILD_SERVER=ON && cmake --build build
./build/quant-server model.gguf -p 8080 -k uniform_4b

# Works with the OpenAI Python SDK
curl http://localhost:8080/v1/chat/completions \
  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'

Build with -DTQ_BUILD_SERVER=ON. Streaming SSE supported. KV compression configurable per request.

Backends & Performance

Backend	Platform	Status	Notes
NEON	ARM (Apple Silicon)	Production	5.8x SIMD speedup
AVX2	x86	Production
Metal	Apple GPU	Verified	Batch matmul dispatch
CUDA	NVIDIA GPU	Compiles
Vulkan	Cross-platform	Compiles
WASM	Browser	NEW	192KB binary
MSVC	Windows	NEW	VS 2019/2022

Performance breakdown (Gemma 4 26B on M1 Pro)

Component	ms/token	Share
Attention matmul (Q8_0 NEON)	168	65%
MoE experts (IQ3_XXS/IQ4_NL NEON)	72	28%
Attention scores	3	1%
Other	14	6%
Total	257	3.9 tok/s

FAQ

How is this different from llama.cpp?

llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (72K LOC) you can read, modify, and embed. Different tools for different problems: llama.cpp optimizes speed, quant.cpp optimizes memory (KV compression) and embeddability (single header).

llama.cpp already has Q4 KV. How is yours better?

Both use 4 bits, but quality differs: llama.cpp Q4_0 KV gives PPL +10.6%, quant.cpp gives +0.0%. The difference: quant.cpp quantizes K and V independently with type-appropriate methods, and offers delta compression (3-bit at +1.3% PPL). llama.cpp has no equivalent.

How does this compare to Karpathy's llm.c?

Similar philosophy: minimal C, educational. Key differences: quant.cpp supports quantized weights (Q4_K_M, Q8_0, IQ2), multiple architectures (Llama, Qwen, Gemma, MoE), GGUF loading, and KV cache compression. Think of llm.c as the textbook and quant.cpp as the production-ready version.

Can I embed this in my app?

Yes. Two options:

Single-header: Copy quant.h, #define QUANT_IMPLEMENTATION in one .c file. Done.
Full library: Link against libturboquant.a.

Works on Linux, macOS, Windows (MSVC/MinGW), iOS, Android, and WASM.

No GPU — is this useless?

If you need 100+ tok/s, use llama.cpp with Metal/CUDA. If you need to embed inference in an iOS app, WASM module, game engine, or IoT device — quant.cpp works. CPU on Apple Silicon: 25 tok/s (1.7B), 11.6 tok/s (3B), 3.9 tok/s (26B MoE).

Can it run in the browser?

Yes. cd wasm && bash build.sh. The WASM binary is 192KB. Drop a GGUF model and chat. Everything runs client-side.

What about sub-3-bit quantization?

Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acceptable quality. Per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.

Documentation

Document	Description
API Reference	Full C API for quant.h and libturboquant (730 lines)
Custom Quantization	Add your own KV type in 3 functions
ROADMAP	Project direction and planned features
CHANGELOG	Version history and release notes
Tech Report	Architecture and benchmarks (Arxiv draft)
WASM Demo	Try it in your browser — no install needed

References

TurboQuant (ICLR 2026) — KV cache compression theory
QJL (AAAI 2025) — Quantized JL transform
PolarQuant (AISTATS 2026) — Polar coordinate quantization

QuantumAI · GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM inference with 7x longer context — pure C, zero dependencies

The Problem

The Result

How It Compares

vs llama.cpp: Quality at same bit budget

vs every other engine

Supported Models

KV Cache Compression

The Idea

Quality vs Compression

Modes

QK-norm Aware (Gemma 4)

Quick Start

Single-Header Mode

Browser Demo (WASM)

Docker & Server

Backends & Performance

FAQ

Documentation

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLM inference with 7x longer context — pure C, zero dependencies

The Problem

The Result

How It Compares

vs llama.cpp: Quality at same bit budget

vs every other engine

Supported Models

KV Cache Compression

The Idea

Quality vs Compression

Modes

QK-norm Aware (Gemma 4)

Quick Start

Single-Header Mode

Browser Demo (WASM)

Docker & Server

Backends & Performance

FAQ

Documentation

References