Skip to content

Latest commit

 

History

History
128 lines (89 loc) · 6.57 KB

File metadata and controls

128 lines (89 loc) · 6.57 KB

quant.cpp: Practical KV Cache Compression in 67K Lines of C

Abstract

We present quant.cpp, a minimal LLM inference engine that achieves 6.9x KV cache compression with zero perplexity degradation. The engine is implemented in 67K lines of C11 with no external dependencies, and ships as a 15K-line single-header library (quant.h) embeddable in any C project. We implement seven quantization algorithms for KV cache compression, including PolarQuant, QJL, and a novel delta compression scheme that enables 3-bit key quantization at only +1.3% PPL. On a 16GB Mac, quant.cpp extends context length from 50K to 350K tokens for Llama 3.2 3B, and from 4K to 30K tokens for Gemma 4 26B-A4B (128-expert MoE). We describe the architecture, quantization plugin system, and lessons learned from GPU acceleration experiments on Apple Silicon.

1. Introduction

Large language model inference is increasingly memory-bound. At 32K context length, an 8B model's KV cache consumes 4GB — more than the model weights themselves. While weight quantization (Q4, Q8) is well-studied, KV cache compression receives less attention despite dominating memory usage at long contexts.

Existing KV cache quantization in production engines (llama.cpp Q4_0) introduces +10.6% perplexity degradation — noticeable quality loss. We show that type-aware independent K/V quantization achieves +0.0% degradation at the same bit budget.

quant.cpp is designed around three principles:

  1. Readable: The full transformer forward pass fits in one file (tq_transformer.c, 2500 lines).
  2. Embeddable: The single-header quant.h (15K lines) compiles with cc app.c -lm -lpthread.
  3. Extensible: Adding a new quantization type requires implementing three functions and registering them in a trait table.

2. Architecture

2.1 Quantization Plugin System

Each KV quantization type is defined by a trait struct:

typedef struct {
    const char* name;
    int block_size;          // elements per block (typically 128)
    size_t type_size;        // bytes per block
    void (*quantize)(const float* src, void* dst, int n);
    void (*dequantize)(const void* src, float* dst, int n);
    void (*attention)(const float* q, const void* kv, float* scores, int seq, int dim);
} tq_type_traits_t;

Seven types are implemented:

Type Bits Algorithm Block Size PPL vs FP32
TQ_UNIFORM_4B 4 Min-max 128 +0.0%
TQ_UNIFORM_2B 2 Min-max 128 varies
TQ_POLAR_3B 3 PolarQuant 128 +0.8%
TQ_POLAR_4B 4 PolarQuant 128 +0.0%
TQ_QJL_1B 1 QJL sign hash 256 +3.2%
TQ_TURBO_3B 3 Polar 2b + QJL 1b 128 +1.0%
TQ_TURBO_4B 4 Polar 3b + QJL 1b 128 +0.0%

2.2 Delta Compression

Standard KV caching stores each key vector independently. We observe that adjacent key vectors (positions t and t-1) differ by ~30% of their absolute range. Delta mode stores key[t] - reconstruct(key[t-1]), reducing the dynamic range and enabling 3-bit quantization.

Every 64 tokens, an FP32 I-frame is stored (like video compression) to bound drift accumulation. This yields 3-bit compression at +1.3% PPL, compared to +62% without delta encoding.

2.3 QK-Norm Aware Compression

Models with QK-norm (Gemma 4) normalize key vectors to the unit sphere, creating extremely sparse distributions (256 dimensions, ~56 active). We find that 4-bit quantization achieves only 0.62 cosine similarity on QK-normed keys — destroying directional information.

Our solution: auto-detect QK-normed models and store keys in FP32 while quantizing only values to Q4. This preserves perfect key precision with 3.5x value memory reduction.

3. Supported Architectures

quant.cpp supports seven model architectures:

  • Llama 3 (GQA, standard RoPE)
  • Qwen 3.5 (DeltaNet hybrid attention)
  • Gemma 3/4 (sliding + full attention, 4 norms/layer)
  • Gemma 4 MoE (128 experts, dual-FFN, learned RoPE, GeGLU)
  • Qwen MoE (256 experts, shared expert)

The Gemma 4 26B-A4B-it implementation required solving 10 architecture-specific issues including dual-FFN parallel execution, layer_output_scale semantics, and attention_scale=1.0 for QK-normed models.

4. GPU Acceleration Experiments

We conducted extensive Metal GPU experiments on Apple M1 Pro:

Approach SmolLM2 135M vs CPU
CPU NEON Q4×Q8 fused dot 96 tok/s 1.0x
Per-matmul Metal dispatch 38 tok/s 0.4x
2-commit GPU graph 18 tok/s 0.2x
1-commit GPU graph 22 tok/s 0.2x
+ Weight repacking 27 tok/s 0.3x
+ uint16 mask kernel 27 tok/s 0.3x

Finding: For batch-1 token generation on Apple Silicon unified memory, CPU NEON saturates memory bandwidth more efficiently than GPU due to command buffer dispatch overhead (~0.3ms per commit). GPU acceleration requires a tensor graph IR (like ggml) that compiles the entire forward pass into a single GPU dispatch — effectively building a GPU inference framework from scratch.

5. Performance

5.1 Speed

Model Params tok/s (M1 Pro)
SmolLM2 135M 135M 96
Llama 3.2 3B 3B 17
Gemma 4 26B-A4B 26B (4B active) 3.9

5.2 KV Compression Quality

WikiText-2 PPL on SmolLM2 1.7B:

Config PPL vs FP32 Compression
FP32 baseline 14.63 1.0x
4b K + FP16 V 14.63 +0.00% 1.6x
4b K + Q4 V 14.57 -0.4% 6.9x
Delta 3b K + Q4 V 14.82 +1.3% 8.5x
llama.cpp Q4_0 KV 16.18 +10.6% 3.8x

5.3 Context Extension

On 16GB Mac M1 Pro:

Model FP16 KV quant.cpp KV Gain
Llama 3.2 3B 50K tokens 350K tokens 6.9x
Gemma 4 26B MoE 4K tokens 30K tokens 6.9x

6. Related Work

  • TurboQuant (Zandieh et al., ICLR 2026): KV cache compression theory
  • QJL (AAAI 2025): Quantized Johnson-Lindenstrauss transform
  • PolarQuant (AISTATS 2026): Polar coordinate quantization
  • llama.cpp: Production inference engine with Q4 KV quantization
  • llm.c (Karpathy): Minimal C training/inference, educational focus

7. Conclusion

quant.cpp demonstrates that practical KV cache compression is achievable in a minimal, embeddable codebase. The key insight is that independent K/V quantization with type-aware methods eliminates the quality degradation seen in uniform approaches. The project serves as both a production-ready library for embedding LLM inference in applications and a research platform for experimenting with new quantization algorithms.

Code: https://github.com/quantumaikr/quant.cpp