[Draft] Support TurboQuant KV-cache quantization by lvliang-intel · Pull Request #1634 · intel/auto-round

lvliang-intel · 2026-03-27T13:28:50Z

Description

We have implemented a working TurboQuant KV-cache prototype in AutoRound with both algorithm-side and runtime-side support.
We used two references for this work: the [TurboQuant paper] (https://arxiv.org/abs/2504.19874) (ICLR 2026, Google Research) and the vLLM TurboQuant PR vllm-project/vllm#38280, and they influenced different parts of the implementation.

What is implemented:
1 Core TurboQuant quantization pipeline
2 QJL residual correction
3 GPU acceleration
4 Runtime cache modes
a. Pre-dequant mode: quantize then immediately dequantize and store bf16 KV tensors in a standard cache
b. Packed mode: store bit-packed KV cache with explicit unpack + reconstruction on read
Both modes are wired through the TurboQuant KV-cache runtime builder

Currently the test result for TurboQuant 4bit is good, but TurboQuant 3 bit and 2 bit still have problem.
TurboQuant decode is still outside attention, so HBM traffic is still too high.

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

…upport_turbo_quant

for more information, see https://pre-commit.ci

wenhuach21 · 2026-03-29T05:05:40Z

You’re moving really fast!

That said, I don’t think this feature brings clear benefits to AR at the moment. The main advantage seems to be eval accuracy with quantized kv cache. However, for Transformers, KV cache quantization at the prefill stage is difficult, which could introduce bias.

lvliang-intel · 2026-03-30T01:40:04Z

I don’t think this feature brings clear benefits to AR at the moment. The main advantage seems to be eval accuracy with quantized kv cache. However, for Transformers, KV cache quantization at the prefill stage is difficult, which could introduce bias.

You are right, I think the core value of TurboQuant lies in memory savings in serving scenarios, which translates to higher throughput.

lvliang-intel · 2026-03-30T06:10:50Z

Eval lambada_openai result
CUDA_VISIBLE_DEVICES=0 python eval_turboquant.py --model_path /mnt/disk4/lvl/Qwen3-30B-A3B-Instruct-2507/ --tasks lambada_openai --bits 2,3,4 --mode packed --residual_length 16 --batch_size 8
eval_turboquant.py

1 KV Cache Compression
Config KV Mem (KB) Raw (KB) Compression
Baseline 24576.0 24576.0 1.00x
DynCache (control) 24576.0 24576.0 1.00x
TQ 2b packed rl=16 7860.0 24576.0 6.25x
TQ 3b packed rl=16 10836.0 24576.0 4.54x
TQ 4b packed rl=16 13812.0 24576.0 3.56x

2 Task Accuracy
Config lambada_openai (acc)
Baseline 0.7151
DynCache (control) 0.7151 (+0.0000)
TQ 2b packed rl=16 0.4003 (-0.3148)
TQ 3b packed rl=16 0.6559 (-0.0592)
TQ 4b packed rl=16 0.7091 (-0.0060)

lvliang-intel added 2 commits March 27, 2026 21:19

Support TurboQuant

858e1f4

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' of https://github.com/intel/auto-round into lvl/s…

42a4c64

…upport_turbo_quant

lvliang-intel changed the title ~~[Draft] Support~~ [Draft] Support TurboQuant KV-cache quantization Mar 27, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf811c0

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Support TurboQuant KV-cache quantization#1634

[Draft] Support TurboQuant KV-cache quantization#1634
lvliang-intel wants to merge 3 commits intomainfrom
lvl/support_turbo_quant

lvliang-intel commented Mar 27, 2026 •

edited

Loading

Uh oh!

wenhuach21 commented Mar 29, 2026

Uh oh!

lvliang-intel commented Mar 30, 2026

Uh oh!

lvliang-intel commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lvliang-intel commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

wenhuach21 commented Mar 29, 2026

Uh oh!

lvliang-intel commented Mar 30, 2026

Uh oh!

lvliang-intel commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lvliang-intel commented Mar 27, 2026 •

edited

Loading