Skip to content

Add per-chunk statistics computation and serialization in Nimble#719

Open
kewang1024 wants to merge 1 commit into
facebookincubator:mainfrom
kewang1024:export-D104712551
Open

Add per-chunk statistics computation and serialization in Nimble#719
kewang1024 wants to merge 1 commit into
facebookincubator:mainfrom
kewang1024:export-D104712551

Conversation

@kewang1024
Copy link
Copy Markdown
Contributor

Summary:
CONTEXT: Nimble lacks per-chunk column statistics for filter pushdown. Currently, filters on non-indexed columns require reading and decoding every chunk within a stripe.

WHAT: This diff adds the write-path infrastructure for per-chunk min/max statistics using a Parquet-inspired binary design:

  • ChunkStats struct with raw-byte min/max values (PLAIN-encoded in the stream native type) + hasNulls flag + nullCount, making the schema type-agnostic
  • ChunkIndex.fbs extended with StreamChunkStats table using min_values/max_values [ubyte] + value_size + has_nulls + null_counts
  • Stats computed during StreamChunker::next() via std::minmax_element on the exact chunk slice, with NaN-aware handling for floats
  • ChunkIndexWriter serializes per-chunk stats into the chunk index alongside position data
  • NimbleConfig: added nimble.chunk.stats.enabled DDL property and wiring in NimbleWriterOptionBuilder
  • Unit tests for chunk stats round-trip and direct computeChunkStats tests

New FlatBuffer fields are optional so old readers ignore them and old files work without them.

Differential Revision: D104712551

Summary:
CONTEXT: Nimble lacks per-chunk column statistics for filter pushdown. Currently, filters on non-indexed columns require reading and decoding every chunk within a stripe.

WHAT: This diff adds the write-path infrastructure for per-chunk min/max statistics using a Parquet-inspired binary design:
- ChunkStats struct with raw-byte min/max values (PLAIN-encoded in the stream native type) + hasNulls flag + nullCount, making the schema type-agnostic
- ChunkIndex.fbs extended with StreamChunkStats table using min_values/max_values [ubyte] + value_size + has_nulls + null_counts
- Stats computed during StreamChunker::next() via std::minmax_element on the exact chunk slice, with NaN-aware handling for floats
- ChunkIndexWriter serializes per-chunk stats into the chunk index alongside position data
- NimbleConfig: added nimble.chunk.stats.enabled DDL property and wiring in NimbleWriterOptionBuilder
- Unit tests for chunk stats round-trip and direct computeChunkStats tests

New FlatBuffer fields are optional so old readers ignore them and old files work without them.

Differential Revision: D104712551
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 12, 2026

@kewang1024 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104712551.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant