quant.cpp: Practical KV Cache Compression in 67K Lines of C

Abstract

We present quant.cpp, a minimal LLM inference engine that achieves 6.9x KV cache compression with zero perplexity degradation. The engine is implemented in 67K lines of C11 with no external dependencies, and ships as a 15K-line single-header library (quant.h) embeddable in any C project. We implement seven quantization algorithms for KV cache compression, including PolarQuant, QJL, and a novel delta compression scheme that enables 3-bit key quantization at only +1.3% PPL. On a 16GB Mac, quant.cpp extends context length from 50K to 350K tokens for Llama 3.2 3B, and from 4K to 30K tokens for Gemma 4 26B-A4B (128-expert MoE). We describe the architecture, quantization plugin system, and lessons learned from GPU acceleration experiments on Apple Silicon.

1. Introduction

Large language model inference is increasingly memory-bound. At 32K context length, an 8B model's KV cache consumes 4GB — more than the model weights themselves. While weight quantization (Q4, Q8) is well-studied, KV cache compression receives less attention despite dominating memory usage at long contexts.

Existing KV cache quantization in production engines (llama.cpp Q4_0) introduces +10.6% perplexity degradation — noticeable quality loss. We show that type-aware independent K/V quantization achieves +0.0% degradation at the same bit budget.

quant.cpp is designed around three principles:

Readable: The full transformer forward pass fits in one file (tq_transformer.c, 2500 lines).
Embeddable: The single-header quant.h (15K lines) compiles with cc app.c -lm -lpthread.
Extensible: Adding a new quantization type requires implementing three functions and registering them in a trait table.

2. Architecture

2.1 Quantization Plugin System

Each KV quantization type is defined by a trait struct:

typedef struct {
    const char* name;
    int block_size;          // elements per block (typically 128)
    size_t type_size;        // bytes per block
    void (*quantize)(const float* src, void* dst, int n);
    void (*dequantize)(const void* src, float* dst, int n);
    void (*attention)(const float* q, const void* kv, float* scores, int seq, int dim);
} tq_type_traits_t;

Seven types are implemented:

Type	Bits	Algorithm	Block Size	PPL vs FP32
TQ_UNIFORM_4B	4	Min-max	128	+0.0%
TQ_UNIFORM_2B	2	Min-max	128	varies
TQ_POLAR_3B	3	PolarQuant	128	+0.8%
TQ_POLAR_4B	4	PolarQuant	128	+0.0%
TQ_QJL_1B	1	QJL sign hash	256	+3.2%
TQ_TURBO_3B	3	Polar 2b + QJL 1b	128	+1.0%
TQ_TURBO_4B	4	Polar 3b + QJL 1b	128	+0.0%

2.2 Delta Compression

Standard KV caching stores each key vector independently. We observe that adjacent key vectors (positions t and t-1) differ by ~30% of their absolute range. Delta mode stores key[t] - reconstruct(key[t-1]), reducing the dynamic range and enabling 3-bit quantization.

Every 64 tokens, an FP32 I-frame is stored (like video compression) to bound drift accumulation. This yields 3-bit compression at +1.3% PPL, compared to +62% without delta encoding.

2.3 QK-Norm Aware Compression

Models with QK-norm (Gemma 4) normalize key vectors to the unit sphere, creating extremely sparse distributions (256 dimensions, ~56 active). We find that 4-bit quantization achieves only 0.62 cosine similarity on QK-normed keys — destroying directional information.

Our solution: auto-detect QK-normed models and store keys in FP32 while quantizing only values to Q4. This preserves perfect key precision with 3.5x value memory reduction.

3. Supported Architectures

quant.cpp supports seven model architectures:

Llama 3 (GQA, standard RoPE)
Qwen 3.5 (DeltaNet hybrid attention)
Gemma 3/4 (sliding + full attention, 4 norms/layer)
Gemma 4 MoE (128 experts, dual-FFN, learned RoPE, GeGLU)
Qwen MoE (256 experts, shared expert)

The Gemma 4 26B-A4B-it implementation required solving 10 architecture-specific issues including dual-FFN parallel execution, layer_output_scale semantics, and attention_scale=1.0 for QK-normed models.

4. GPU Acceleration Experiments

We conducted extensive Metal GPU experiments on Apple M1 Pro:

Approach	SmolLM2 135M	vs CPU
CPU NEON Q4×Q8 fused dot	96 tok/s	1.0x
Per-matmul Metal dispatch	38 tok/s	0.4x
2-commit GPU graph	18 tok/s	0.2x
1-commit GPU graph	22 tok/s	0.2x
+ Weight repacking	27 tok/s	0.3x
+ uint16 mask kernel	27 tok/s	0.3x

Finding: For batch-1 token generation on Apple Silicon unified memory, CPU NEON saturates memory bandwidth more efficiently than GPU due to command buffer dispatch overhead (~0.3ms per commit). GPU acceleration requires a tensor graph IR (like ggml) that compiles the entire forward pass into a single GPU dispatch — effectively building a GPU inference framework from scratch.

5. Performance

5.1 Speed

Model	Params	tok/s (M1 Pro)
SmolLM2 135M	135M	96
Llama 3.2 3B	3B	17
Gemma 4 26B-A4B	26B (4B active)	3.9

5.2 KV Compression Quality

WikiText-2 PPL on SmolLM2 1.7B:

Config	PPL	vs FP32	Compression
FP32 baseline	14.63	—	1.0x
4b K + FP16 V	14.63	+0.00%	1.6x
4b K + Q4 V	14.57	-0.4%	6.9x
Delta 3b K + Q4 V	14.82	+1.3%	8.5x
llama.cpp Q4_0 KV	16.18	+10.6%	3.8x

5.3 Context Extension

On 16GB Mac M1 Pro:

Model	FP16 KV	quant.cpp KV	Gain
Llama 3.2 3B	50K tokens	350K tokens	6.9x
Gemma 4 26B MoE	4K tokens	30K tokens	6.9x

6. Related Work

TurboQuant (Zandieh et al., ICLR 2026): KV cache compression theory
QJL (AAAI 2025): Quantized Johnson-Lindenstrauss transform
PolarQuant (AISTATS 2026): Polar coordinate quantization
llama.cpp: Production inference engine with Q4 KV quantization
llm.c (Karpathy): Minimal C training/inference, educational focus

7. Conclusion

quant.cpp demonstrates that practical KV cache compression is achievable in a minimal, embeddable codebase. The key insight is that independent K/V quantization with type-aware methods eliminates the quality degradation seen in uniform approaches. The project serves as both a production-ready library for embedding LLM inference in applications and a research platform for experimenting with new quantization algorithms.

Code: https://github.com/quantumaikr/quant.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quant.cpp: Practical KV Cache Compression in 67K Lines of C

Abstract

1. Introduction

2. Architecture

2.1 Quantization Plugin System

2.2 Delta Compression

2.3 QK-Norm Aware Compression

3. Supported Architectures

4. GPU Acceleration Experiments

5. Performance

5.1 Speed

5.2 KV Compression Quality

5.3 Context Extension

6. Related Work

7. Conclusion

FilesExpand file tree

quant_cpp_tech_report.md

Latest commit

History

quant_cpp_tech_report.md

File metadata and controls

quant.cpp: Practical KV Cache Compression in 67K Lines of C

Abstract

1. Introduction

2. Architecture

2.1 Quantization Plugin System

2.2 Delta Compression

2.3 QK-Norm Aware Compression

3. Supported Architectures

4. GPU Acceleration Experiments

5. Performance

5.1 Speed

5.2 KV Compression Quality

5.3 Context Extension

6. Related Work

7. Conclusion