Add scale_i32_bf16 operator

### Problem
  INT8 GEMM outputs int32 accumulators that need converting back to bf16 with a scale factor. The existing `dequant_i32_bf16` (#96) operator uses per-group packed buffer formats that don't compose directly with GEMM output. CPU-side dequantization requires expensive NPU↔CPU round trips (16MB i32 download + 8MB bf16 upload per GEMM call).

  ### Solution
  New `scale_i32_bf16` operator that takes:
  - **Input 1**: plain `(size,)` int32 buffer, directly from GEMM output
  - **Input 2**: tiny `(num_cores × 16,)` bf16 scale buffer (~256 bytes)
  - **Output**: plain `(size,)` bf16 buffer

  No packed formats. The scale ObjectFIFO is acquired once per core and reused across all tile iterations. Wins over CPU dequant at prompt lengths ≥~3500 tokens.

  ### Tests
  7 non-extensive parameter combinations, all passing. Configurations: 1-8 columns, 1-2 channels, tile sizes 256-8192.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scale_i32_bf16 operator #99

Problem

Solution

Tests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add scale_i32_bf16 operator #99

Description

Problem

Solution

Tests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions