vulkan: add TurboQuant KV cache support and optimized turbo mat-vec paths#140
Open
Fenix46 wants to merge 5 commits into
Open
vulkan: add TurboQuant KV cache support and optimized turbo mat-vec paths#140Fenix46 wants to merge 5 commits into
Fenix46 wants to merge 5 commits into
Conversation
- Add dequant shaders for turbo2_0, turbo4_0, tq3_1s with WHT/RHT - Add mul_mat_vec shader for tq3_1s - Add flash attention support for turbo2_0, turbo3_0, turbo4_0 - Fix copy_to_quant/copy_from_quant for TurboQuant types - Fix dequant_funcs_cm2.glsl typo (grid -> g2) - Fix vulkan-shaders-gen: use vulkan1.3 target for _cm1/_int8/q8_1 - Add turbo2_0/turbo4_0 to FA scalar and cm1 shader generation - Add pre-built ggml-vulkan-shaders.hpp with all new shader externs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The local_size_x = 128 WHT butterfly path in copy_to_quant.comp was gated on DATA_A_TURBO3_0 only. TURBO2_0 and TURBO4_0 fell through to the generic 32-thread path, producing an incomplete 32-of-128 WHT and corrupting the KV cache entries for those types. Fix: extend the condition to cover all three turbo types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hoist the norm/qs/signs loads out of the per-element loop in dequantize4 for all three turbo types. Since iqs is always a multiple of 4, all four elements within a dequantize4 call share the same qs byte (and the same signs byte for turbo3_0). This reduces the number of buffer loads from 4x2-4x3 per call down to 2-3, lowering cache pressure in the Q*K inner loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire up pipeline_dequant_mul_mat_vec_f32_f32, _f16_f32, and _id_f32 for GGML_TYPE_TURBO2_0/3_0/4_0 and add them to the switch in ggml_vk_get_dequantize_mul_mat_vec() and ggml_vk_get_dequantize_mul_mat_vec_id(). Without this the decode path fell back to a slower dequantize-then-matmul route instead of the dedicated quantized kernel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I've tested this on an RX 580 8GB and it seems to work while having 2-3x performance gain on text ingestion. Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds Vulkan backend support for TurboQuant KV cache formats and wires the missing optimized execution paths for TurboQuant types.
The main goal is to make TurboQuant KV cache usable on Vulkan for long-context inference while avoiding unnecessary fallback paths during decode and flash-attention execution.
Included changes:
TURBO2_0TURBO3_0TURBO4_0TQ3_1STQ4_1STURBO2_0,TURBO3_0, andTURBO4_0copy_to_quant/copy_from_quanthandling for TurboQuant typesSET_ROWSworkgroup configuration forTURBO2_0andTURBO4_0dequantize4()inflash_attn_base.glslmul_mat_vec/mul_mat_vec_idVulkan pipelines for TurboQuant typesMotivation
Before this PR, the Vulkan TurboQuant path was incomplete for KV-cache usage.
In particular:
TURBO2_0andTURBO4_0could fall through to the wrongSET_ROWSworkgroup size, producing incomplete WHT output and corrupting KV cache entries;mul_mat_vecpipelines and fall back to a slower dequantize-then-matmul path;This PR makes the Vulkan TurboQuant KV-cache path more complete, fixes correctness issues, and reduces avoidable overhead in the attention/decode hot paths.
Details
TurboQuant KV cache support
Adds Vulkan shader support and generated shader declarations for the TurboQuant KV-cache path, including dequantization, copy-to-quant, copy-from-quant, flash-attention integration, and required shader registration.
Supported internal formats in this PR:
TURBO2_0TURBO3_0TURBO4_0TQ3_1STQ4_1SUser-facing CLI cache type names:
turbo2turbo3turbo4tq3_1stq4_1sCorrectness fix for
SET_ROWSThe WHT butterfly path in
copy_to_quant.compusedlocal_size_x = 128, but the condition was only enabled forDATA_A_TURBO3_0.As a result,
TURBO2_0andTURBO4_0could fall back to the generic 32-thread path, producing incomplete 32/128 WHT output and corrupting KV-cache entries.This PR extends the condition to cover all three TurboQuant types:
DATA_A_TURBO2_0DATA_A_TURBO3_0DATA_A_TURBO4_0Flash-attention dequant optimization
The TurboQuant
dequantize4()path inflash_attn_base.glslnow hoists shared loads out of the per-element loop.Since
iqsis always a multiple of 4, the four elements handled bydequantize4()share the same packedqsbyte, and forTURBO3_0also the same signs byte.This reduces repeated buffer loads in the Q*K inner loop and lowers cache pressure during flash attention.
Dedicated Vulkan mat-vec pipelines
This PR registers dedicated Vulkan
mul_mat_vecandmul_mat_vec_idpipelines for:TURBO2_0TURBO3_0TURBO4_0Without this, the decode path could miss the optimized quantized kernels and fall back to a slower dequantize-then-matmul route.
Testing
Tested locally with Vulkan backend using TurboQuant KV cache enabled on AMD RX570 8GB.
It works perfectly; I haven't noticed any token generation issues on any of the Turbo variants. However, without native FP16 on the card, I can't fully validate the Turbo4 tests.
What I can add is that Turbo2 is faster, even though everything is computed in FP32.
Suggested build test: