We present quant.cpp, a minimal LLM inference engine that achieves 6.9x KV cache compression with zero perplexity degradation. The engine is implemented in 67K lines of C11 with no external dependencies, and ships as a 15K-line single-header library (quant.h) embeddable in any C project. We implement seven quantization algorithms for KV cache compression, including PolarQuant, QJL, and a novel delta compression scheme that enables 3-bit key quantization at only +1.3% PPL. On a 16GB Mac, quant.cpp extends context length from 50K to 350K tokens for Llama 3.2 3B, and from 4K to 30K tokens for Gemma 4 26B-A4B (128-expert MoE). We describe the architecture, quantization plugin system, and lessons learned from GPU acceleration experiments on Apple Silicon.
Large language model inference is increasingly memory-bound. At 32K context length, an 8B model's KV cache consumes 4GB — more than the model weights themselves. While weight quantization (Q4, Q8) is well-studied, KV cache compression receives less attention despite dominating memory usage at long contexts.
Existing KV cache quantization in production engines (llama.cpp Q4_0) introduces +10.6% perplexity degradation — noticeable quality loss. We show that type-aware independent K/V quantization achieves +0.0% degradation at the same bit budget.
quant.cpp is designed around three principles:
- Readable: The full transformer forward pass fits in one file (tq_transformer.c, 2500 lines).
- Embeddable: The single-header quant.h (15K lines) compiles with
cc app.c -lm -lpthread. - Extensible: Adding a new quantization type requires implementing three functions and registering them in a trait table.
Each KV quantization type is defined by a trait struct:
typedef struct {
const char* name;
int block_size; // elements per block (typically 128)
size_t type_size; // bytes per block
void (*quantize)(const float* src, void* dst, int n);
void (*dequantize)(const void* src, float* dst, int n);
void (*attention)(const float* q, const void* kv, float* scores, int seq, int dim);
} tq_type_traits_t;Seven types are implemented:
| Type | Bits | Algorithm | Block Size | PPL vs FP32 |
|---|---|---|---|---|
| TQ_UNIFORM_4B | 4 | Min-max | 128 | +0.0% |
| TQ_UNIFORM_2B | 2 | Min-max | 128 | varies |
| TQ_POLAR_3B | 3 | PolarQuant | 128 | +0.8% |
| TQ_POLAR_4B | 4 | PolarQuant | 128 | +0.0% |
| TQ_QJL_1B | 1 | QJL sign hash | 256 | +3.2% |
| TQ_TURBO_3B | 3 | Polar 2b + QJL 1b | 128 | +1.0% |
| TQ_TURBO_4B | 4 | Polar 3b + QJL 1b | 128 | +0.0% |
Standard KV caching stores each key vector independently. We observe that adjacent key vectors (positions t and t-1) differ by ~30% of their absolute range. Delta mode stores key[t] - reconstruct(key[t-1]), reducing the dynamic range and enabling 3-bit quantization.
Every 64 tokens, an FP32 I-frame is stored (like video compression) to bound drift accumulation. This yields 3-bit compression at +1.3% PPL, compared to +62% without delta encoding.
Models with QK-norm (Gemma 4) normalize key vectors to the unit sphere, creating extremely sparse distributions (256 dimensions, ~56 active). We find that 4-bit quantization achieves only 0.62 cosine similarity on QK-normed keys — destroying directional information.
Our solution: auto-detect QK-normed models and store keys in FP32 while quantizing only values to Q4. This preserves perfect key precision with 3.5x value memory reduction.
quant.cpp supports seven model architectures:
- Llama 3 (GQA, standard RoPE)
- Qwen 3.5 (DeltaNet hybrid attention)
- Gemma 3/4 (sliding + full attention, 4 norms/layer)
- Gemma 4 MoE (128 experts, dual-FFN, learned RoPE, GeGLU)
- Qwen MoE (256 experts, shared expert)
The Gemma 4 26B-A4B-it implementation required solving 10 architecture-specific issues including dual-FFN parallel execution, layer_output_scale semantics, and attention_scale=1.0 for QK-normed models.
We conducted extensive Metal GPU experiments on Apple M1 Pro:
| Approach | SmolLM2 135M | vs CPU |
|---|---|---|
| CPU NEON Q4×Q8 fused dot | 96 tok/s | 1.0x |
| Per-matmul Metal dispatch | 38 tok/s | 0.4x |
| 2-commit GPU graph | 18 tok/s | 0.2x |
| 1-commit GPU graph | 22 tok/s | 0.2x |
| + Weight repacking | 27 tok/s | 0.3x |
| + uint16 mask kernel | 27 tok/s | 0.3x |
Finding: For batch-1 token generation on Apple Silicon unified memory, CPU NEON saturates memory bandwidth more efficiently than GPU due to command buffer dispatch overhead (~0.3ms per commit). GPU acceleration requires a tensor graph IR (like ggml) that compiles the entire forward pass into a single GPU dispatch — effectively building a GPU inference framework from scratch.
| Model | Params | tok/s (M1 Pro) |
|---|---|---|
| SmolLM2 135M | 135M | 96 |
| Llama 3.2 3B | 3B | 17 |
| Gemma 4 26B-A4B | 26B (4B active) | 3.9 |
WikiText-2 PPL on SmolLM2 1.7B:
| Config | PPL | vs FP32 | Compression |
|---|---|---|---|
| FP32 baseline | 14.63 | — | 1.0x |
| 4b K + FP16 V | 14.63 | +0.00% | 1.6x |
| 4b K + Q4 V | 14.57 | -0.4% | 6.9x |
| Delta 3b K + Q4 V | 14.82 | +1.3% | 8.5x |
| llama.cpp Q4_0 KV | 16.18 | +10.6% | 3.8x |
On 16GB Mac M1 Pro:
| Model | FP16 KV | quant.cpp KV | Gain |
|---|---|---|---|
| Llama 3.2 3B | 50K tokens | 350K tokens | 6.9x |
| Gemma 4 26B MoE | 4K tokens | 30K tokens | 6.9x |
- TurboQuant (Zandieh et al., ICLR 2026): KV cache compression theory
- QJL (AAAI 2025): Quantized Johnson-Lindenstrauss transform
- PolarQuant (AISTATS 2026): Polar coordinate quantization
- llama.cpp: Production inference engine with Q4 KV quantization
- llm.c (Karpathy): Minimal C training/inference, educational focus
quant.cpp demonstrates that practical KV cache compression is achievable in a minimal, embeddable codebase. The key insight is that independent K/V quantization with type-aware methods eliminates the quality degradation seen in uniform approaches. The project serves as both a production-ready library for embedding LLM inference in applications and a research platform for experimenting with new quantization algorithms.