Add cast pushdown optimization for bit-packed integer widening#8046
Open
joseph-isaacs wants to merge 8 commits into
Open
Add cast pushdown optimization for bit-packed integer widening#8046joseph-isaacs wants to merge 8 commits into
joseph-isaacs wants to merge 8 commits into
Conversation
Widening a bit-packed narrow integer column to a wider type (e.g. u16 -> u32) currently has no cast pushdown: cast(bit_packed) canonicalizes to a full-length narrow PrimitiveArray and then casts it, allocating two full-length buffers and round-tripping the narrow intermediate through RAM. Add `BitUnpackedChunks::decode_cast_into`, which unpacks each 1024-element FastLanes chunk into the existing cache-resident scratch buffer and maps each value through a closure into a differently-typed output, plus `unpack_and_cast_into_builder` which uses it to unpack straight into a wide PrimitiveBuilder (handling validity and patches). Add a divan benchmark (cast_bitpacked) comparing the current canonicalize-then-cast path against the pushdown, over single and chunked arrays, with and without patches. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Extend BitPacked's CastKernel so that widening integer casts (e.g. u16 -> u32) dispatch to the unpack-and-cast pushdown automatically, instead of falling back to canonicalize-then-cast. The cast is gated to strictly wider integer targets where every bit-packable value is representable (unsigned source, or signed-to-signed), so no per-value bounds check is needed. Update the cast_bitpacked benchmark to measure the real array.cast(u32).execute() path alongside an explicit canonicalize-then-cast baseline and the direct helper. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Merging this PR will improve performance by 19.8%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
225.1 µs | 187.9 µs | +19.8% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/cast-bitpacked-pushdown-VNtVh (dde5949) with develop (f852d72)
Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Generalize apply_patches_to_uninit_range_fn to a cross-type Fn(S) -> T so the cast pushdown reuses it instead of a near-identical copy, and drop the redundant identity wrapper. Behaviour and performance are unchanged. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Replace the direct-kernel and direct-helper cast tests with a single end-to-end test that drives array.cast(target).execute(), proving the public Vortex path dispatches to BitPacked's widening pushdown across all supported integer pairs, chunk-boundary lengths, and a sliced case. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements a "cast pushdown" optimization for widening casts on bit-packed integer columns (e.g.,
u16 -> u32). Rather than canonicalizing to a full-length intermediate array and then casting it, the optimization unpacks each FastLanes chunk into a cache-resident scratch buffer and casts values directly into the output buffer during decompression.Running locally I get