Skip to content

feat(encoding): Subint Split Encoding#697

Open
David-C-L wants to merge 24 commits into
facebookincubator:mainfrom
David-C-L:subint_split_synced
Open

feat(encoding): Subint Split Encoding#697
David-C-L wants to merge 24 commits into
facebookincubator:mainfrom
David-C-L:subint_split_synced

Conversation

@David-C-L
Copy link
Copy Markdown

@David-C-L David-C-L commented May 5, 2026

Summary

Adds SubIntSplitEncoding, a new encoding for 32- and 64-bit integer streams (including float/double via their physical bit patterns) that decomposes each value into bit-range sub-streams and encodes each independently.

The key insight is that structured integers (e.g. timestamps, IDs, packed bitfields) often have different compressibility in different bit regions. For example, a 64-bit Snowflake ID with a fixed 41-bit timestamp, 11-bit datacenter, and a 12-bit monotone counter can encode each section differently (e.g. dictionary on the datacenter region; delta on the monotone counter region). Crucially, splitting at exact, known, boundaries can be insufficient: depending on the data stream, a timestamp's upper bits can have very little variance and may benefit from a prefix-encoding, while the lower bits may vary a lot, but be roughly sorted, and benefit from a FOR encoding.

Core components:

  • Sampler (SubIntSplitSampler.h): block-stratified sampling that preserves local temporal structure needed for run-length estimation.
  • Metrics + Cost Models (SubIntSplitMetrics.h, SubIntSplitCostModels.h): lightweight per-segment statistics (min/max, unique count capped at 16K, run stats) and cost estimators for all relevant nimble encodings, used as the DP inner loop.
  • Selector (SubIntSplitSelector.h): DP over bit positions 0..N evaluating a full N×N grid of bit-range costs (with incremental bit-extension to amortize extraction) and a configurable per-boundary penalty to avoid over-splitting.
  • Config (SubIntSplitConfig.h): boundary serialization ("0-7;8-15;16-31") and the preserve/recompute mode config keys used to replay a captured layout across stripes.
  • Encoding (SubIntSplitEncoding.h): encode splits values into per-section buffers using the narrowest unsigned type that fits each section's width, calls encodeNested for each, and writes a compact binary header. Decoder OR-combines sections back into the output.

Integration:

  • Registered in EncodingFactory, EncodingSelectionPolicy (read factor 0.85), EncodingSizeEstimation (heuristic via FixedBitWidth proxy with a range-width filter), EncodingUtils, EncodingLayout capture/serialize/deserialize, and EncodingUtilities traversal.
  • NimbleDumpLib no longer depends on test-only TestUtils.h; getRawDataSize is now self-contained in the tools library.

Compilation fixes (required to build the new code under gcc/clang):

  • Moved EncodingTypeTraits from inner class bodies to namespace scope (explicit specializations inside class templates are ill-formed).
  • Replaced explicit member function specializations with if constexpr in several encoding tests.
  • Fixed designated initializer ordering in VeloxWriterTest and initializer-list ambiguity in selective reader tests.

Prior benchmarks

TO BE ADDED SOON

Coming updates

  • FOR encoding
  • Frequency Partition encoding

David-C-L added 18 commits May 5, 2026 23:44
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 6, 2026

@srsuryadev has imported this pull request. If you are a Meta employee, you can view this in D103975634.

@srsuryadev srsuryadev requested review from srsuryadev and zzhao0 May 6, 2026 13:44

struct SelectorConfig {
int minSegmentWidth{1};
double splitPenalty{10.0}; // extra bits charged per additional split boundary
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@David-C-L You can correct me here, for Subint Encoding the split information can be stored in a header/metadata.

Hence we may need not have the extra bit information added as a penalty during cost computation, since this will be a constant overhead, or do you thinking differently here? Thanks

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, the split information can certainly be considered a constant in the header/metadata. This penalty can be removed and really should be superseded by a cost-model that accounts for expected decode times (bulk and single) since more splits, in my preliminary experiments, shows a degradation in decode time. However, it remains to be seen whether we can make the overhead of stitching many splits together negligible.

@srsuryadev srsuryadev requested a review from pedroerp May 6, 2026 16:54
@srsuryadev
Copy link
Copy Markdown
Contributor

srsuryadev commented May 13, 2026

@David-C-L Similarly, let's add some e2e tests so we can merge the initial functional version, so we can merge, thank you! https://github.com/facebookincubator/nimble/blob/main/dwio/nimble/velox/selective/tests/E2EFilterTest.cpp

Let us add Fuzzer tests as well for much more robustness, so we can sweep for exhaustive range of inputs: https://github.com/facebookincubator/nimble/blob/main/dwio/nimble/fuzzer/encoding/EncodingFuzzer.h

void SubIntSplitEncoding<T>::materialize(uint32_t rowCount, void* buffer) {
physicalType* output = static_cast<physicalType*>(buffer);
std::fill(output, output + rowCount, physicalType{0});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@David-C-L For further optimizations, we can do pre-allocation of the buffer which can potentially fit the L1 cache and do block iteration on them to avoid L1/L2 cache trashing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunking operation is now implemented along with some additional optimisations (e.g. AVX2 and prefetching) from my local implementation

David-C-L added 2 commits May 18, 2026 23:46
…reduce L1/L2 thrashing, add a persistent scratch buffer to avoid reallocation, include AVX2 paths and prefetch hints, and use first-section assignment instead of OR to avoid upfront std::fill cost
@srsuryadev srsuryadev changed the title Subint Split Encoding feat(encoding): Subint Split Encoding May 20, 2026
Copy link
Copy Markdown
Contributor

@srsuryadev srsuryadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the initial version, later we can tune and add further optimization tricks to improve decode performance and storage efficiency.

We will merge this after the long weekend.

Thank you very much for the contribution, @David-C-L!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants