fix(fsst): pick i32 vs i64 codes offsets per call#7836
Closed
connortsui20 wants to merge 5 commits into
Closed
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #7833.
fsst_compress_iterpreviously hardcodedVarBinBuilder::<i32>for theFSST output, panicking once cumulative compressed bytes crossed
i32::MAX.Approach
fsst_compressnow does an upfront pass over the input to compute thetotal uncompressed byte count, then picks
VarBinBuilder<i32>when theFSST upper-bound compressed size (
2 * total + 7 * n) fits ini32::MAX, falling back toVarBinBuilder<i64>only when it wouldactually overflow. Common case keeps the compact i32 layout; pathological
inputs are correctly handled.
fsst_compress_iter(single-pass iterator API) keeps its publicsignature and now always uses i64. It can't size-estimate without
consuming the iterator, and direct callers are test-only.
vortex-array/src/arrays/varbin/compute/compare.rs: latent bug surfacedby the i64 path. With i64 offsets,
Datum::try_newproduces an ArrowLargeBinary/LargeUtf8, but the RHS scalar was hardcoded toBinary/Utf8. Arrow refusesLargeBinary == Binary. The RHS nowpicks the matching Arrow scalar from
lhs.data_type().Tests
The previous ~5 GiB
#[ignore]d regression test is replaced with:upper_bound_fits_i32helper.i32codes_offsetsfor typical inputs.The i64 path is exercised only for inputs whose worst-case compressed
size exceeds
i32::MAX, which is too expensive to test directly; theboundary unit tests cover the dispatch.
Checks
cargo test -p vortex-fsst --lib— 83 passed (was 78 before this PR).cargo test -p vortex-array --lib arrays::varbin— 93 passed.cargo test -p vortex-btrblocks --lib— 35 passed.cargo +nightly fmt -- --check,cargo clippy -p vortex-fsst -p vortex-array --all-targets --all-features— clean../scripts/public-api.sh— no public-api.lock changes.🤖 Generated with Claude Code