feat(encoding): Subint Split Encoding by David-C-L · Pull Request #697 · facebookincubator/nimble

David-C-L · 2026-05-05T23:12:11Z

Summary

Adds SubIntSplitEncoding, a new encoding for 32- and 64-bit integer streams (including float/double via their physical bit patterns) that decomposes each value into bit-range sub-streams and encodes each independently.

The key insight is that structured integers (e.g. timestamps, IDs, packed bitfields) often have different compressibility in different bit regions. For example, a 64-bit Snowflake ID with a fixed 41-bit timestamp, 11-bit datacenter, and a 12-bit monotone counter can encode each section differently (e.g. dictionary on the datacenter region; delta on the monotone counter region). Crucially, splitting at exact, known, boundaries can be insufficient: depending on the data stream, a timestamp's upper bits can have very little variance and may benefit from a prefix-encoding, while the lower bits may vary a lot, but be roughly sorted, and benefit from a FOR encoding.

Core components:

Sampler (SubIntSplitSampler.h): block-stratified sampling that preserves local temporal structure needed for run-length estimation.
Metrics + Cost Models (SubIntSplitMetrics.h, SubIntSplitCostModels.h): lightweight per-segment statistics (min/max, unique count capped at 16K, run stats) and cost estimators for all relevant nimble encodings, used as the DP inner loop.
Selector (SubIntSplitSelector.h): DP over bit positions 0..N evaluating a full N×N grid of bit-range costs (with incremental bit-extension to amortize extraction) and a configurable per-boundary penalty to avoid over-splitting.
Config (SubIntSplitConfig.h): boundary serialization ("0-7;8-15;16-31") and the preserve/recompute mode config keys used to replay a captured layout across stripes.
Encoding (SubIntSplitEncoding.h): encode splits values into per-section buffers using the narrowest unsigned type that fits each section's width, calls encodeNested for each, and writes a compact binary header. Decoder OR-combines sections back into the output.

Integration:

Registered in EncodingFactory, EncodingSelectionPolicy (read factor 0.85), EncodingSizeEstimation (heuristic via FixedBitWidth proxy with a range-width filter), EncodingUtils, EncodingLayout capture/serialize/deserialize, and EncodingUtilities traversal.
NimbleDumpLib no longer depends on test-only TestUtils.h; getRawDataSize is now self-contained in the tools library.

Compilation fixes (required to build the new code under gcc/clang):

Moved EncodingTypeTraits from inner class bodies to namespace scope (explicit specializations inside class templates are ill-formed).
Replaced explicit member function specializations with if constexpr in several encoding tests.
Fixed designated initializer ordering in VeloxWriterTest and initializer-list ambiguity in selective reader tests.

Prior benchmarks

TO BE ADDED SOON

Coming updates

FOR encoding
Frequency Partition encoding

…coding

…e DP planner

…P selector

…ntSplitConfig.h)

…, skip, materialize

…alize config as extra data

…d encode

…on, and dispatch

…etRawDataSize in NimbleDumpLib

…cc-14 inner-class specialization

… in encoding tests

…t ambiguity in selective reader tests

…Encoding call

…out serialization, and replay

meta-codesync · 2026-05-06T03:42:30Z

@srsuryadev has imported this pull request. If you are a Meta employee, you can view this in D103975634.

srsuryadev · 2026-05-06T14:20:59Z

+
+struct SelectorConfig {
+  int minSegmentWidth{1};
+  double splitPenalty{10.0}; // extra bits charged per additional split boundary


@David-C-L You can correct me here, for Subint Encoding the split information can be stored in a header/metadata.

Hence we may need not have the extra bit information added as a penalty during cost computation, since this will be a constant overhead, or do you thinking differently here? Thanks

Yeah you're right, the split information can certainly be considered a constant in the header/metadata. This penalty can be removed and really should be superseded by a cost-model that accounts for expected decode times (bulk and single) since more splits, in my preliminary experiments, shows a degradation in decode time. However, it remains to be seen whether we can make the overhead of stitching many splits together negligible.

srsuryadev · 2026-05-13T03:52:46Z

@David-C-L Similarly, let's add some e2e tests so we can merge the initial functional version, so we can merge, thank you! https://github.com/facebookincubator/nimble/blob/main/dwio/nimble/velox/selective/tests/E2EFilterTest.cpp

Let us add Fuzzer tests as well for much more robustness, so we can sweep for exhaustive range of inputs: https://github.com/facebookincubator/nimble/blob/main/dwio/nimble/fuzzer/encoding/EncodingFuzzer.h

srsuryadev · 2026-05-13T06:37:17Z

+void SubIntSplitEncoding<T>::materialize(uint32_t rowCount, void* buffer) {
+  physicalType* output = static_cast<physicalType*>(buffer);
+  std::fill(output, output + rowCount, physicalType{0});
+


@David-C-L For further optimizations, we can do pre-allocation of the buffer which can potentially fit the L1 cache and do block iteration on them to avoid L1/L2 cache trashing

This chunking operation is now implemented along with some additional optimisations (e.g. AVX2 and prefetching) from my local implementation

…or use in E2E filter tests

…oat32/64

…tsplit

…reduce L1/L2 thrashing, add a persistent scratch buffer to avoid reallocation, include AVX2 paths and prefetch hints, and use first-section assignment instead of OR to avoid upfront std::fill cost

srsuryadev

LGTM for the initial version, later we can tune and add further optimization tricks to improve decode performance and storage efficiency.

We will merge this after the long weekend.

Thank you very much for the contribution, @David-C-L!

David-C-L added 18 commits May 5, 2026 23:44

[build] Fix FLASE typo in CMakeLists.txt flatbuffers variable

ea12dcb

[common/types] Add SubIntSplit = 12 to EncodingType enum and toString

c12ebb5

[encodings/subintsplit] Add section identifier stub for SubIntSplitEn…

56b0411

…coding

[encodings/subintsplit] Add sampler: block-stratified sampling for th…

8c62f86

…e DP planner

[encodings/subintsplit] Add segment metrics and cost models for the D…

98043f5

…P selector

[encodings/subintsplit] Add DP-based bit-range split selector

2f282ac

[encodings/subintsplit] Add split-boundary config serialization (SubI…

4b62584

…ntSplitConfig.h)

[encodings/subintsplit] Implement SubIntSplitEncoding: encode, decode…

e28ea9a

…, skip, materialize

[encodings/layout] Add Config::values() accessor and serialize/deseri…

06d9592

…alize config as extra data

[encodings/factory] Register SubIntSplit in EncodingFactory create an…

15de29c

…d encode

[encodings/policy] Add SubIntSplit to selection policy, size estimati…

92302f9

…on, and dispatch

[tools] Add SubIntSplit traversal in EncodingUtilities; internalize g…

44b5745

…etRawDataSize in NimbleDumpLib

[tests/fix] Move EncodingTypeTraits to namespace scope to fix clang/g…

e76ea63

…cc-14 inner-class specialization

[tests/fix] Replace explicit member specializations with if constexpr…

42a6397

… in encoding tests

[tests/fix] Fix designated initializer field ordering in VeloxWriterTest

8faae82

[tests/fix] Add NullableArrayData/MapData helpers for initializer-lis…

32cfb85

…t ambiguity in selective reader tests

[benchmarks/fix] Replace TestUtils::Encoder with direct FixedBitWidth…

bd2716c

…Encoding call

[tests] Add SubIntSplitEncoding tests: round-trip, preserve mode, lay…

3963438

…out serialization, and replay

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2026

srsuryadev requested review from srsuryadev and zzhao0 May 6, 2026 13:44

srsuryadev reviewed May 6, 2026

View reviewed changes

srsuryadev requested a review from pedroerp May 6, 2026 16:54

srsuryadev reviewed May 13, 2026

View reviewed changes

David-C-L added 4 commits May 18, 2026 21:52

[encodings/legacy] add SubIntSplit to legacy factories and dispatch f…

598e0b5

…or use in E2E filter tests

[velox/selective/tests/E2E] add SubIntSplit E2E tests for int32/64 fl…

b838805

…oat32/64

[encodings/tests] add SubInt to TestUtils dispatch

7f0f27c

[fuzzer/encodings] add bit structured fuzzer data generator for subin…

92b8264

…tsplit

David-C-L added 2 commits May 18, 2026 23:46

[fuzzer/encoding/test] add SubIntSplit fuzzer test

2e2a779

[encodings/SubIntSplit] implement a chunking loop for materialize to …

cf0d900

…reduce L1/L2 thrashing, add a persistent scratch buffer to avoid reallocation, include AVX2 paths and prefetch hints, and use first-section assignment instead of OR to avoid upfront std::fill cost

srsuryadev changed the title ~~Subint Split Encoding~~ feat(encoding): Subint Split Encoding May 20, 2026

srsuryadev approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(encoding): Subint Split Encoding#697

feat(encoding): Subint Split Encoding#697
David-C-L wants to merge 24 commits into
facebookincubator:mainfrom
David-C-L:subint_split_synced

David-C-L commented May 5, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented May 6, 2026

Uh oh!

srsuryadev May 6, 2026

Uh oh!

David-C-L May 12, 2026

Uh oh!

srsuryadev commented May 13, 2026 •

edited

Loading

Uh oh!

srsuryadev May 13, 2026

Uh oh!

David-C-L May 19, 2026

Uh oh!

srsuryadev left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

David-C-L commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core components:

Integration:

Compilation fixes (required to build the new code under gcc/clang):

Prior benchmarks

Coming updates

Uh oh!

meta-codesync Bot commented May 6, 2026

Uh oh!

srsuryadev May 6, 2026

Choose a reason for hiding this comment

Uh oh!

David-C-L May 12, 2026

Choose a reason for hiding this comment

Uh oh!

srsuryadev commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srsuryadev May 13, 2026

Choose a reason for hiding this comment

Uh oh!

David-C-L May 19, 2026

Choose a reason for hiding this comment

Uh oh!

srsuryadev left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

David-C-L commented May 5, 2026 •

edited

Loading

srsuryadev commented May 13, 2026 •

edited

Loading