fastlanes: streaming compare + between kernels for BitPacked#8015
fastlanes: streaming compare + between kernels for BitPacked#8015joseph-isaacs wants to merge 1 commit into
Conversation
Adds `CompareKernel` and `BetweenKernel` for `BitPacked` that walk the
encoded array one 1024-element FastLanes block at a time through a single
reused scratch buffer, splice any `Patches` into the block in place via a
sorted-index cursor, then fold a `Fn(T) -> bool` predicate over the block
and write the bits directly into the output bit buffer. The materialised
primitive never appears.
The inner predicate-fold matches the canonical `BitBuffer::collect_bool`
shape — pack 64 bools into a `u64` in a tight loop — which rustc
auto-vectorises into the same `pcmpeq` + `psllq` (vector shift to bit
position) + `por` (OR into accumulator) pattern that `arrow-ord::apply_op`
lowers to. Verified via `objdump` on the bench binary (344 monomorphised
`stream_predicate` variants emit those SIMD instructions in the inner loop).
Smallest possible diff: only adds the two kernels and a private helper
shared between them, no benches, no public-API expansion beyond the two
trait impls.
encodings/fastlanes/public-api.lock | 8 +
encodings/fastlanes/src/bitpacking/compute/between.rs | 248 +
encodings/fastlanes/src/bitpacking/compute/compare.rs | 187 +
encodings/fastlanes/src/bitpacking/compute/mod.rs | 3 +
encodings/fastlanes/src/bitpacking/compute/stream_predicate.rs | 211 +
encodings/fastlanes/src/bitpacking/vtable/kernels.rs | 4 +
6 files changed, 661 insertions(+)
Checks:
- cargo nextest run -p vortex-fastlanes (278 passed)
- cargo clippy -p vortex-fastlanes --all-targets --all-features
- cargo +nightly fmt -p vortex-fastlanes --check
- ./scripts/public-api.sh (only adds the two new trait impls)
Signed-off-by: Claude <noreply@anthropic.com>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.052x ➖ datafusion / vortex-file-compressed (1.052x ➖, 0↑ 3↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.968x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.982x ➖, 0↑ 0↓)
datafusion / parquet (0.975x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.963x ➖, 2↑ 0↓)
duckdb / vortex-compact (1.001x ➖, 0↑ 0↓)
duckdb / parquet (0.971x ➖, 1↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.043x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.038x ➖, 0↑ 1↓)
datafusion / parquet (0.993x ➖, 2↑ 0↓)
datafusion / arrow (1.052x ➖, 0↑ 4↓)
duckdb / vortex-file-compressed (0.980x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.001x ➖, 0↑ 0↓)
duckdb / parquet (0.992x ➖, 2↑ 1↓)
duckdb / duckdb (0.999x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMEFile Size Changes (195 files changed, -98.4% overall, 0↑ 195↓)
Totals:
|
Merging this PR will improve performance by 32.36%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
197.9 µs | 162 µs | +22.19% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
358.4 µs | 323.5 µs | +10.78% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
211.2 µs | 175.8 µs | +20.11% |
| ⚡ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
224.8 µs | 188.6 µs | +19.23% |
| ⚡ | Simulation | fast_lt_out_of_range[16, 1024] |
67.8 µs | 31 µs | ×2.2 |
| ⚡ | Simulation | fast_lt_out_of_range[4, 1024] |
87.5 µs | 37.2 µs | ×2.4 |
| ⚡ | Simulation | baseline_lt[4, 65536] |
251.9 µs | 201 µs | +25.32% |
| ⚡ | Simulation | fast_eq_out_of_range[4, 1024] |
67 µs | 30.4 µs | ×2.2 |
| ⚡ | Simulation | fast_lt_out_of_range[4, 65536] |
262 µs | 109.3 µs | ×2.4 |
| ⚡ | Simulation | new_alp_prim_test_between[f32, 2048] |
62.1 µs | 53.1 µs | +16.9% |
| ❌ | Simulation | baseline_lt[4, 1024] |
64.1 µs | 78.8 µs | -18.64% |
| ⚡ | Simulation | fast_eq_out_of_range[16, 1024] |
67.7 µs | 31.1 µs | ×2.2 |
| ⚡ | Simulation | fast_eq_out_of_range[4, 65536] |
246 µs | 86.9 µs | ×2.8 |
| ⚡ | Simulation | fast_eq_out_of_range[16, 65536] |
291.1 µs | 137.4 µs | ×2.1 |
| ⚡ | Simulation | fast_lt_out_of_range[16, 65536] |
306.3 µs | 126.3 µs | ×2.4 |
| ⚡ | Simulation | baseline_eq[16, 65536] |
259.4 µs | 229.9 µs | +12.8% |
| ⚡ | Simulation | baseline_eq[4, 65536] |
237.9 µs | 185.1 µs | +28.48% |
| ⚡ | Simulation | baseline_lt[16, 65536] |
274.5 µs | 217.7 µs | +26.07% |
| ❌ | Simulation | new_alp_prim_test_between[f32, 32768] |
153.2 µs | 200.9 µs | -23.72% |
| ⚡ | Simulation | new_alp_prim_test_between[f64, 32768] |
250.4 µs | 208.5 µs | +20.1% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/fastlane-compare-kernel-7slGu (ee44dd6) with develop (7b47788)
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.052x ➖, 2↑ 9↓)
datafusion / vortex-compact (1.059x ➖, 0↑ 9↓)
datafusion / parquet (1.051x ➖, 0↑ 12↓)
duckdb / vortex-file-compressed (1.051x ➖, 1↑ 21↓)
duckdb / vortex-compact (1.048x ➖, 0↑ 11↓)
duckdb / parquet (1.039x ➖, 0↑ 8↓)
duckdb / duckdb (1.056x ➖, 0↑ 21↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.049x ➖, 0↑ 1↓)
datafusion / vortex-compact (0.821x ➖, 1↑ 0↓)
datafusion / parquet (0.976x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.996x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.073x ➖, 0↑ 0↓)
duckdb / parquet (0.990x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.005x ➖, 0↑ 0↓)
duckdb / parquet (1.011x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.980x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.987x ➖, 0↑ 0↓)
datafusion / parquet (0.985x ➖, 0↑ 0↓)
datafusion / arrow (0.966x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.983x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.994x ➖, 0↑ 0↓)
duckdb / parquet (0.990x ➖, 0↑ 0↓)
duckdb / duckdb (0.997x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.944x ➖, 4↑ 0↓)
datafusion / parquet (0.941x ➖, 3↑ 0↓)
duckdb / vortex-file-compressed (0.931x ➖, 8↑ 0↓)
duckdb / parquet (0.956x ➖, 4↑ 0↓)
duckdb / duckdb (0.956x ➖, 4↑ 0↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: Random AccessVortex (geomean): 0.850x ✅ unknown / unknown (0.879x ✅, 21↑ 0↓)
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.060x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.052x ➖, 0↑ 0↓)
datafusion / parquet (1.013x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.996x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.976x ➖, 0↑ 0↓)
duckdb / parquet (1.001x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 0.997x ➖ unknown / unknown (0.997x ➖, 2↑ 3↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.923x ➖, 2↑ 0↓)
datafusion / vortex-compact (0.893x ➖, 2↑ 0↓)
datafusion / parquet (0.975x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.906x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.956x ➖, 0↑ 0↓)
duckdb / parquet (0.867x ➖, 0↑ 0↓)
Full attributed analysis
|
Adds
CompareKernelandBetweenKernelforBitPackedthat walk theencoded array one 1024-element FastLanes block at a time through a single
reused scratch buffer, splice any
Patchesinto the block in place via asorted-index cursor, then fold a
Fn(T) -> boolpredicate over the blockand write the bits directly into the output bit buffer. The materialised
primitive never appears.
The inner predicate-fold matches the canonical
BitBuffer::collect_boolshape — pack 64 bools into a
u64in a tight loop — which rustcauto-vectorises into the same
pcmpeq+psllq(vector shift to bitposition) +
por(OR into accumulator) pattern thatarrow-ord::apply_oplowers to. Verified via
objdumpon the bench binary (344 monomorphisedstream_predicatevariants emit those SIMD instructions in the inner loop).Smallest possible diff: only adds the two kernels and a private helper
shared between them, no benches, no public-API expansion beyond the two
trait impls.
Checks:
Signed-off-by: Claude noreply@anthropic.com