feat(encoding): Subint Split Encoding#697
Conversation
…, skip, materialize
…alize config as extra data
…etRawDataSize in NimbleDumpLib
…cc-14 inner-class specialization
… in encoding tests
…t ambiguity in selective reader tests
…out serialization, and replay
|
@srsuryadev has imported this pull request. If you are a Meta employee, you can view this in D103975634. |
|
|
||
| struct SelectorConfig { | ||
| int minSegmentWidth{1}; | ||
| double splitPenalty{10.0}; // extra bits charged per additional split boundary |
There was a problem hiding this comment.
@David-C-L You can correct me here, for Subint Encoding the split information can be stored in a header/metadata.
Hence we may need not have the extra bit information added as a penalty during cost computation, since this will be a constant overhead, or do you thinking differently here? Thanks
There was a problem hiding this comment.
Yeah you're right, the split information can certainly be considered a constant in the header/metadata. This penalty can be removed and really should be superseded by a cost-model that accounts for expected decode times (bulk and single) since more splits, in my preliminary experiments, shows a degradation in decode time. However, it remains to be seen whether we can make the overhead of stitching many splits together negligible.
|
@David-C-L Similarly, let's add some e2e tests so we can merge the initial functional version, so we can merge, thank you! https://github.com/facebookincubator/nimble/blob/main/dwio/nimble/velox/selective/tests/E2EFilterTest.cpp Let us add Fuzzer tests as well for much more robustness, so we can sweep for exhaustive range of inputs: https://github.com/facebookincubator/nimble/blob/main/dwio/nimble/fuzzer/encoding/EncodingFuzzer.h |
| void SubIntSplitEncoding<T>::materialize(uint32_t rowCount, void* buffer) { | ||
| physicalType* output = static_cast<physicalType*>(buffer); | ||
| std::fill(output, output + rowCount, physicalType{0}); | ||
|
|
There was a problem hiding this comment.
@David-C-L For further optimizations, we can do pre-allocation of the buffer which can potentially fit the L1 cache and do block iteration on them to avoid L1/L2 cache trashing
There was a problem hiding this comment.
This chunking operation is now implemented along with some additional optimisations (e.g. AVX2 and prefetching) from my local implementation
…or use in E2E filter tests
…reduce L1/L2 thrashing, add a persistent scratch buffer to avoid reallocation, include AVX2 paths and prefetch hints, and use first-section assignment instead of OR to avoid upfront std::fill cost
srsuryadev
left a comment
There was a problem hiding this comment.
LGTM for the initial version, later we can tune and add further optimization tricks to improve decode performance and storage efficiency.
We will merge this after the long weekend.
Thank you very much for the contribution, @David-C-L!
Summary
Adds SubIntSplitEncoding, a new encoding for 32- and 64-bit integer streams (including float/double via their physical bit patterns) that decomposes each value into bit-range sub-streams and encodes each independently.
The key insight is that structured integers (e.g. timestamps, IDs, packed bitfields) often have different compressibility in different bit regions. For example, a 64-bit Snowflake ID with a fixed 41-bit timestamp, 11-bit datacenter, and a 12-bit monotone counter can encode each section differently (e.g. dictionary on the datacenter region; delta on the monotone counter region). Crucially, splitting at exact, known, boundaries can be insufficient: depending on the data stream, a timestamp's upper bits can have very little variance and may benefit from a prefix-encoding, while the lower bits may vary a lot, but be roughly sorted, and benefit from a FOR encoding.
Core components:
Integration:
Compilation fixes (required to build the new code under gcc/clang):
Prior benchmarks
TO BE ADDED SOON
Coming updates