Skip to content

Add cast pushdown optimization for bit-packed integer widening#8046

Open
joseph-isaacs wants to merge 8 commits into
developfrom
claude/cast-bitpacked-pushdown-VNtVh
Open

Add cast pushdown optimization for bit-packed integer widening#8046
joseph-isaacs wants to merge 8 commits into
developfrom
claude/cast-bitpacked-pushdown-VNtVh

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented May 21, 2026

Summary

This PR implements a "cast pushdown" optimization for widening casts on bit-packed integer columns (e.g., u16 -> u32). Rather than canonicalizing to a full-length intermediate array and then casting it, the optimization unpacks each FastLanes chunk into a cache-resident scratch buffer and casts values directly into the output buffer during decompression.

Running locally I get

  ┌────────────────────┬─────────────────┬──────────────────────┬─────────┐
  │ Case               │ Old public path │ Current cast_execute │ Speedup │
  ├────────────────────┼─────────────────┼──────────────────────┼─────────┤
  │ (65536, 1, 0.0)    │        71.68 us │             6.895 us │   10.4x │
  │ (65536, 16, 0.01)  │        2.229 ms │             423.4 us │    5.3x │
  │ (1048576, 1, 0.01) │        2.387 ms │             337.6 us │    7.1x │
  └────────────────────┴─────────────────┴──────────────────────┴─────────┘

claude added 2 commits May 21, 2026 11:30
Widening a bit-packed narrow integer column to a wider type (e.g. u16 ->
u32) currently has no cast pushdown: cast(bit_packed) canonicalizes to a
full-length narrow PrimitiveArray and then casts it, allocating two
full-length buffers and round-tripping the narrow intermediate through
RAM.

Add `BitUnpackedChunks::decode_cast_into`, which unpacks each
1024-element FastLanes chunk into the existing cache-resident scratch
buffer and maps each value through a closure into a differently-typed
output, plus `unpack_and_cast_into_builder` which uses it to unpack
straight into a wide PrimitiveBuilder (handling validity and patches).

Add a divan benchmark (cast_bitpacked) comparing the current
canonicalize-then-cast path against the pushdown, over single and
chunked arrays, with and without patches.

Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Extend BitPacked's CastKernel so that widening integer casts (e.g. u16 ->
u32) dispatch to the unpack-and-cast pushdown automatically, instead of
falling back to canonicalize-then-cast. The cast is gated to strictly
wider integer targets where every bit-packable value is representable
(unsigned source, or signed-to-signed), so no per-value bounds check is
needed.

Update the cast_bitpacked benchmark to measure the real
array.cast(u32).execute() path alongside an explicit canonicalize-then-cast
baseline and the direct helper.

Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 21, 2026

Merging this PR will improve performance by 19.8%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
✅ 1236 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 225.1 µs 187.9 µs +19.8%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/cast-bitpacked-pushdown-VNtVh (dde5949) with develop (f852d72)

Open in CodSpeed

claude and others added 2 commits May 21, 2026 13:12
Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
u
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested review from a team and robert3005 May 21, 2026 14:39
joseph-isaacs and others added 4 commits May 21, 2026 16:35
u
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
u
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Generalize apply_patches_to_uninit_range_fn to a cross-type Fn(S) -> T so
the cast pushdown reuses it instead of a near-identical copy, and drop the
redundant identity wrapper. Behaviour and performance are unchanged.

Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Replace the direct-kernel and direct-helper cast tests with a single
end-to-end test that drives array.cast(target).execute(), proving the
public Vortex path dispatches to BitPacked's widening pushdown across
all supported integer pairs, chunk-boundary lengths, and a sliced case.

Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the action/benchmark Trigger full benchmarks to run on this PR label May 21, 2026
@github-actions github-actions Bot removed the action/benchmark Trigger full benchmarks to run on this PR label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants