Skip to content

[Draft] Support TurboQuant KV-cache quantization#1634

Draft
lvliang-intel wants to merge 3 commits intomainfrom
lvl/support_turbo_quant
Draft

[Draft] Support TurboQuant KV-cache quantization#1634
lvliang-intel wants to merge 3 commits intomainfrom
lvl/support_turbo_quant

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

@lvliang-intel lvliang-intel commented Mar 27, 2026

Description

We have implemented a working TurboQuant KV-cache prototype in AutoRound with both algorithm-side and runtime-side support.
We used two references for this work: the [TurboQuant paper] (https://arxiv.org/abs/2504.19874) (ICLR 2026, Google Research) and the vLLM TurboQuant PR vllm-project/vllm#38280, and they influenced different parts of the implementation.

What is implemented:
1 Core TurboQuant quantization pipeline
2 QJL residual correction
3 GPU acceleration
4 Runtime cache modes
a. Pre-dequant mode: quantize then immediately dequantize and store bf16 KV tensors in a standard cache
b. Packed mode: store bit-packed KV cache with explicit unpack + reconstruction on read
Both modes are wired through the TurboQuant KV-cache runtime builder

Currently the test result for TurboQuant 4bit is good, but TurboQuant 3 bit and 2 bit still have problem.
TurboQuant decode is still outside attention, so HBM traffic is still too high.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel lvliang-intel changed the title [Draft] Support [Draft] Support TurboQuant KV-cache quantization Mar 27, 2026
@wenhuach21
Copy link
Copy Markdown
Contributor

You’re moving really fast!

That said, I don’t think this feature brings clear benefits to AR at the moment. The main advantage seems to be eval accuracy with quantized kv cache. However, for Transformers, KV cache quantization at the prefill stage is difficult, which could introduce bias.

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

I don’t think this feature brings clear benefits to AR at the moment. The main advantage seems to be eval accuracy with quantized kv cache. However, for Transformers, KV cache quantization at the prefill stage is difficult, which could introduce bias.

You are right, I think the core value of TurboQuant lies in memory savings in serving scenarios, which translates to higher throughput.

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Eval lambada_openai result
CUDA_VISIBLE_DEVICES=0 python eval_turboquant.py --model_path /mnt/disk4/lvl/Qwen3-30B-A3B-Instruct-2507/ --tasks lambada_openai --bits 2,3,4 --mode packed --residual_length 16 --batch_size 8
eval_turboquant.py

1 KV Cache Compression
Config KV Mem (KB) Raw (KB) Compression
Baseline 24576.0 24576.0 1.00x
DynCache (control) 24576.0 24576.0 1.00x
TQ 2b packed rl=16 7860.0 24576.0 6.25x
TQ 3b packed rl=16 10836.0 24576.0 4.54x
TQ 4b packed rl=16 13812.0 24576.0 3.56x

2 Task Accuracy
Config lambada_openai (acc)
Baseline 0.7151
DynCache (control) 0.7151 (+0.0000)
TQ 2b packed rl=16 0.4003 (-0.3148)
TQ 3b packed rl=16 0.6559 (-0.0592)
TQ 4b packed rl=16 0.7091 (-0.0060)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants