Introduction
TurboQuant (ICLR 2026) brings data-oblivious compression to KV caches and embedding vectors, achieving ~3.5 bits per value with provably near-optimal geometry preservation. Unlike codebook-based approaches (GPTQ, AWQ), TurboQuant requires no training, no dataset-specific tuning, and works online — compressing vectors as they arrive.
This issue tracks the full implementation of TurboQuant within the ruvLLM engine, from the core compressor through KV cache integration, embedding store, and production-grade optimizations.
Why TurboQuant?
| Property |
TurboQuant |
GPTQ/AWQ |
PiQ3 (existing) |
| Training required |
No |
Yes (calibration set) |
No |
| Online compression |
Yes |
No (batch) |
Yes |
| Geometry preservation |
Provable (2.7× optimal) |
Empirical |
Empirical |
| KV cache compatible |
Native |
Retrofit |
Not designed for |
| Memory reduction |
~6× vs FP16 |
~4× vs FP16 |
~5× vs FP16 |
| Attention speedup |
Up to 8× |
Limited |
N/A |
Algorithm Overview
TurboQuant is a two-stage pipeline:
-
PolarQuant: Random Hadamard rotation → scalar quantization per coordinate
- Rotation makes dimensions approximately independent (Beta-distributed)
- Enables optimal per-coordinate scalar quantization without codebooks
-
QJL Residual: 1-bit Quantized Johnson-Lindenstrauss on the residual
- Corrects quantization error with just 1 extra bit per dimension
- Produces an unbiased inner product estimator
Implementation Status
Core Compressor (turbo_quant.rs) — ✅ Complete
KV Cache Integration (kv_cache.rs) — ✅ Complete
Embedding Store (turbo_quant.rs) — ✅ Complete
Optimization Roadmap
Phase 1: Inner Product Optimization — 🔄 In Progress
Phase 2: Benchmarks & Profiling
Phase 3: Production Hardening
Phase 4: Advanced Features
Key Metrics
| Metric |
Target |
Current |
| Compression ratio vs FP16 |
>4× |
~4.6× (3.5-bit) |
| Inner product relative error |
<15% |
<15% (tested) |
| Compression throughput |
>1M vec/s |
TBD (benchmarks needed) |
| Attention latency reduction |
>4× |
TBD (integration needed) |
| Max sequence length at 8GB |
>128K |
TBD (simulation needed) |
Files
| File |
Description |
crates/ruvllm/src/quantize/turbo_quant.rs |
Core compressor, cache tier, embedding store |
crates/ruvllm/src/quantize/mod.rs |
Module exports |
crates/ruvllm/src/kv_cache.rs |
Three-tier KV cache with TurboQuant integration |
docs/research/quantization-edge/08-turboquant-kv-cache-compression.md |
Research document |
References
Related
- ADR-090: Quantization pipeline architecture
- Existing PiQ3 quantization in
crates/ruvllm/src/quantize/pi_quant.rs
- Hadamard transform in
crates/ruvllm/src/quantize/hadamard.rs
Introduction
TurboQuant (ICLR 2026) brings data-oblivious compression to KV caches and embedding vectors, achieving ~3.5 bits per value with provably near-optimal geometry preservation. Unlike codebook-based approaches (GPTQ, AWQ), TurboQuant requires no training, no dataset-specific tuning, and works online — compressing vectors as they arrive.
This issue tracks the full implementation of TurboQuant within the ruvLLM engine, from the core compressor through KV cache integration, embedding store, and production-grade optimizations.
Why TurboQuant?
Algorithm Overview
TurboQuant is a two-stage pipeline:
PolarQuant: Random Hadamard rotation → scalar quantization per coordinate
QJL Residual: 1-bit Quantized Johnson-Lindenstrauss on the residual
Implementation Status
Core Compressor (
turbo_quant.rs) — ✅ CompleteTurboQuantCompressorwith Hadamard rotation + scalar quantizationKV Cache Integration (
kv_cache.rs) — ✅ CompleteTurboQuantCacheTier— compressed storage for cold tokensTurboQuantKvCache— three-tier cache (FP16 hot + TurboQuant cold)tail_lengthCacheTier::TurboQuantenum variantCacheQuantization::TurboQuantHybridconfigurationEmbedding Store (
turbo_quant.rs) — ✅ CompleteTurboQuantEmbeddingStorefor RuVector-compatible compressed searchOptimization Roadmap
Phase 1: Inner Product Optimization — 🔄 In Progress
<Hq, Hk>directly (orthogonal invariance)Phase 2: Benchmarks & Profiling
Phase 3: Production Hardening
TurboQuantKvCachestress testsTransformerBlock)Phase 4: Advanced Features
Key Metrics
Files
crates/ruvllm/src/quantize/turbo_quant.rscrates/ruvllm/src/quantize/mod.rscrates/ruvllm/src/kv_cache.rsdocs/research/quantization-edge/08-turboquant-kv-cache-compression.mdReferences
Related
crates/ruvllm/src/quantize/pi_quant.rscrates/ruvllm/src/quantize/hadamard.rs