Problem
INT8 GEMM outputs int32 accumulators that need converting back to bf16 with a scale factor. The existing dequant_i32_bf16 (#96) operator uses per-group packed buffer formats that don't compose directly with GEMM output. CPU-side dequantization requires expensive NPU↔CPU round trips (16MB i32 download + 8MB bf16 upload per GEMM call).
Solution
New scale_i32_bf16 operator that takes:
- Input 1: plain
(size,) int32 buffer, directly from GEMM output
- Input 2: tiny
(num_cores × 16,) bf16 scale buffer (~256 bytes)
- Output: plain
(size,) bf16 buffer
No packed formats. The scale ObjectFIFO is acquired once per core and reused across all tile iterations. Wins over CPU dequant at prompt lengths ≥~3500 tokens.
Tests
7 non-extensive parameter combinations, all passing. Configurations: 1-8 columns, 1-2 channels, tile sizes 256-8192.
Problem
INT8 GEMM outputs int32 accumulators that need converting back to bf16 with a scale factor. The existing
dequant_i32_bf16(#96) operator uses per-group packed buffer formats that don't compose directly with GEMM output. CPU-side dequantization requires expensive NPU↔CPU round trips (16MB i32 download + 8MB bf16 upload per GEMM call).Solution
New
scale_i32_bf16operator that takes:(size,)int32 buffer, directly from GEMM output(num_cores × 16,)bf16 scale buffer (~256 bytes)(size,)bf16 bufferNo packed formats. The scale ObjectFIFO is acquired once per core and reused across all tile iterations. Wins over CPU dequant at prompt lengths ≥~3500 tokens.
Tests
7 non-extensive parameter combinations, all passing. Configurations: 1-8 columns, 1-2 channels, tile sizes 256-8192.