[Draft] Support TurboQuant KV-cache quantization#1634
[Draft] Support TurboQuant KV-cache quantization#1634lvliang-intel wants to merge 3 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…upport_turbo_quant
for more information, see https://pre-commit.ci
|
You’re moving really fast! That said, I don’t think this feature brings clear benefits to AR at the moment. The main advantage seems to be eval accuracy with quantized kv cache. However, for Transformers, KV cache quantization at the prefill stage is difficult, which could introduce bias. |
You are right, I think the core value of TurboQuant lies in memory savings in serving scenarios, which translates to higher throughput. |
|
Eval lambada_openai result 1 KV Cache Compression 2 Task Accuracy |
Description
We have implemented a working TurboQuant KV-cache prototype in AutoRound with both algorithm-side and runtime-side support.
We used two references for this work: the [TurboQuant paper] (https://arxiv.org/abs/2504.19874) (ICLR 2026, Google Research) and the vLLM TurboQuant PR vllm-project/vllm#38280, and they influenced different parts of the implementation.
What is implemented:
1 Core TurboQuant quantization pipeline
2 QJL residual correction
3 GPU acceleration
4 Runtime cache modes
a. Pre-dequant mode: quantize then immediately dequantize and store bf16 KV tensors in a standard cache
b. Packed mode: store bit-packed KV cache with explicit unpack + reconstruction on read
Both modes are wired through the TurboQuant KV-cache runtime builder
Currently the test result for TurboQuant 4bit is good, but TurboQuant 3 bit and 2 bit still have problem.
TurboQuant decode is still outside attention, so HBM traffic is still too high.
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting