Add per-chunk statistics computation and serialization in Nimble#719
Open
kewang1024 wants to merge 1 commit into
Open
Add per-chunk statistics computation and serialization in Nimble#719kewang1024 wants to merge 1 commit into
kewang1024 wants to merge 1 commit into
Conversation
Summary: CONTEXT: Nimble lacks per-chunk column statistics for filter pushdown. Currently, filters on non-indexed columns require reading and decoding every chunk within a stripe. WHAT: This diff adds the write-path infrastructure for per-chunk min/max statistics using a Parquet-inspired binary design: - ChunkStats struct with raw-byte min/max values (PLAIN-encoded in the stream native type) + hasNulls flag + nullCount, making the schema type-agnostic - ChunkIndex.fbs extended with StreamChunkStats table using min_values/max_values [ubyte] + value_size + has_nulls + null_counts - Stats computed during StreamChunker::next() via std::minmax_element on the exact chunk slice, with NaN-aware handling for floats - ChunkIndexWriter serializes per-chunk stats into the chunk index alongside position data - NimbleConfig: added nimble.chunk.stats.enabled DDL property and wiring in NimbleWriterOptionBuilder - Unit tests for chunk stats round-trip and direct computeChunkStats tests New FlatBuffer fields are optional so old readers ignore them and old files work without them. Differential Revision: D104712551
|
@kewang1024 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104712551. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
CONTEXT: Nimble lacks per-chunk column statistics for filter pushdown. Currently, filters on non-indexed columns require reading and decoding every chunk within a stripe.
WHAT: This diff adds the write-path infrastructure for per-chunk min/max statistics using a Parquet-inspired binary design:
New FlatBuffer fields are optional so old readers ignore them and old files work without them.
Differential Revision: D104712551