Skip to content

Add UncompressedSizeInBytes pushdown for many encodings#7953

Open
AdamGS wants to merge 5 commits into
developfrom
adamg/all-uncompressed-size-fns
Open

Add UncompressedSizeInBytes pushdown for many encodings#7953
AdamGS wants to merge 5 commits into
developfrom
adamg/all-uncompressed-size-fns

Conversation

@AdamGS
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS commented May 15, 2026

Summary

This PR introduces pushdown of UncompressedSizeInBytes for a bunch of encodings, much of it is using a shared kernel for fixed-sized types that is applicable in a ton of places.

@AdamGS AdamGS added the changelog/performance A performance improvement label May 15, 2026
@AdamGS AdamGS force-pushed the adamg/all-uncompressed-size-fns branch from f5e00a5 to 3821439 Compare May 15, 2026 16:37
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 15, 2026

Merging this PR will improve performance by 16.74%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
✅ 1236 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime cuda/bitpacked_u8/unpack/3bw[100M] 350.3 µs 300.1 µs +16.74%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing adamg/all-uncompressed-size-fns (3241f4f) with develop (a88e6b1)

Open in CodSpeed

@tl-dr-review tl-dr-review Bot added the tldr label May 15, 2026
@gatesn gatesn added tldr and removed tldr labels May 15, 2026
@tl-dr-review tl-dr-review Bot added tldr and removed tldr labels May 15, 2026
@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented May 19, 2026

@claude review this PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

Claude finished @AdamGS's task in 7m 26s —— View job


PR Review: Add UncompressedSizeInBytes pushdown for all encodings

Todo List

  • Read core implementation files
  • Review encoding implementations (alp, bytebool, fastlanes, pco, zigzag, etc.)
  • Review FSST and Zstd (complex compressed encodings)
  • Review vortex-array kernel and session changes
  • Investigate CodSpeed regression in encode_varbin
  • Post comprehensive review

This is a solid and well-scoped PR. The architecture is clean: a FixedWidthUncompressedSizeInBytesKernel shared across all fixed-width encodings, with bespoke kernels for FSST, Zstd, VarBin, and List. The vortex-file refactor toward initialize() functions is a nice improvement. A few issues worth addressing:


🔴 Code Duplication in FSST and Zstd Kernels

Both encodings/fsst/src/compute/uncompressed_size.rs and encodings/zstd/src/compute/uncompressed_size.rs define private copies of checked_len_mul and validity_uncompressed_size_in_bytes that are byte-for-byte identical to the versions in vortex-array:

// In fsst/compute/uncompressed_size.rs AND zstd/compute/uncompressed_size.rs:
fn checked_len_mul(len: usize, width: usize, name: &str) -> VortexResult<u64> { ... }
fn validity_uncompressed_size_in_bytes(validity: Mask) -> VortexResult<u64> { ... }

These are pub(crate) in vortex-array/src/aggregate_fn/fns/uncompressed_size_in_bytes/mod.rs, so they're inaccessible from downstream crates. Since they're part of the clearly intended extension point for encoding implementors, they should be made pub (or re-exported through a stable path) so encoding crates don't need to duplicate them.

Fix this →


🟡 Missing FSST Uncompressed Size Test

Zstd has test_uncompressed_size_does_not_decompress_primitive_frames which verifies the kernel doesn't decode data. The FSST kernel in encodings/fsst/src/compute/uncompressed_size.rs has no equivalent test. Given the complexity of the slice-aware uncompressed_lengths_size_typed computation, a test verifying correctness (and that it avoids decompression) would be valuable.


🟡 Zstd n_values == 0 Fallback Mixes Semantics

In selected_frame_uncompressed_size:

let frame_n_values = if frame_meta.n_values == 0 {
    frame_uncompressed_size  // bytes being used as a value count!
} else {
    usize::try_from(frame_meta.n_values).vortex_expect("frame size must fit usize")
};

When n_values == 0, frame_uncompressed_size (a byte count) is used as a value count. This mixes semantics in a non-obvious way. A comment explaining when n_values == 0 is valid and what the byte-as-value-count implies for slice arithmetic would help reviewers understand the intent. If this is used only for raw byte frames where the slice arithmetic doesn't apply (e.g., primitive arrays where the FixedWidthUncompressedSizeInBytesKernel is preferred), a debug assertion or explicit comment would make that clearer.


🟡 ListUncompressedSizeInBytesKernel Computes VarBinView-Sized Offsets and Sizes

The List kernel computes:

let offsets_size = checked_len_mul(array.len(), size_of::<u64>(), "list offsets")?;
let sizes_size = checked_len_mul(array.len(), size_of::<u64>(), "list sizes")?;

This hardcodes u64 for both buffers. It matches the canonical ListViewArray layout (list_view_uncompressed_size_in_bytes also uses len * 8 per buffer). However, the underlying ListArray typically uses u32 or u64 offsets — this should remain consistent if the canonical form is ever updated. A comment noting the canonical form this targets would help.


🟡 Performance Regression in encode_varbin[(1000, 2)]

CodSpeed reports a 12.09% regression (148.5 µs → 168.9 µs), but flags it explicitly: "Different runtime environments detected." Looking at the benchmark:

fn encode_varbin(bencher: Bencher, (len, unique_values): (usize, usize)) {
    let varbin_arr = VarBinArray::from(gen_varbin_words(len, unique_values));
    bencher.bench_refs(|arr| dict_encode(&arr.clone().into_array()));
}

dict_encodedict_encode_with_constraintsbytes_dict_builder.encode() — none of these code paths call UncompressedSizeInBytes. The VarBinUncompressedSizeInBytesKernel registration in AggregateFnSession::default() doesn't affect this path. The regression is very likely environmental noise.

That said, it's worth acknowledging on CodSpeed and keeping an eye on it after re-running.


✅ Things that look good

  • The FixedWidthUncompressedSizeInBytesKernel handles Null/Bool/Primitive/Decimal/Extension dtypes and returns None for variable-width types — the correct fallback behaviour.
  • try_accumulate is used as the fast path; the accumulator falls back to decode only when the kernel returns None.
  • Statistics caching in uncompressed_size_in_bytes_u64 is correct and avoids re-computation.
  • DateTimeParts, DecimalByteParts, and Sequence all decode to fixed-width canonical forms, so FixedWidthUncompressedSizeInBytesKernel is semantically correct for each.
  • ByteBool → canonical BoolArray is bit-packed (ceil(len/8) bytes), and FixedWidthUncompressedSizeInBytesKernel with Bool dtype computes exactly that. ✓
  • Sparse → canonical dense array of len elements at the logical width. FixedWidthUncompressedSizeInBytesKernel gives len * width. ✓
  • The vortex-file refactor cleanly moves ad-hoc arrays.register() calls into initialize() so that each encoding crate owns its full registration surface.
  • Test coverage in mod.rs is broad: null, bool, primitive (nullable and all-invalid), decimal, VarBinView, VarBin (sliced), List, FixedSizeList, struct, extension, constant, chunked, and cache-hit cases.

@tl-dr-review tl-dr-review Bot added tldr and removed tldr labels May 19, 2026
AdamGS added 4 commits May 20, 2026 12:04
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/all-uncompressed-size-fns branch from b800eb0 to ea71fd0 Compare May 20, 2026 11:43
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS changed the title Add UncompressedSizeInBytes pushdown for all encodings Add UncompressedSizeInBytes pushdown for many encodings May 20, 2026
@AdamGS AdamGS requested review from gatesn, joseph-isaacs and onursatici and removed request for joseph-isaacs May 20, 2026 15:24
@AdamGS AdamGS marked this pull request as ready for review May 20, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement tldr

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants