Add per-chunk statistics computation and serialization in Nimble by kewang1024 · Pull Request #719 · facebookincubator/nimble

kewang1024 · 2026-05-12T20:57:34Z

Summary:
CONTEXT: Nimble lacks per-chunk column statistics for filter pushdown. Currently, filters on non-indexed columns require reading and decoding every chunk within a stripe.

WHAT: This diff adds the write-path infrastructure for per-chunk min/max statistics using a Parquet-inspired binary design:

ChunkStats struct with raw-byte min/max values (PLAIN-encoded in the stream native type) + hasNulls flag + nullCount, making the schema type-agnostic
ChunkIndex.fbs extended with StreamChunkStats table using min_values/max_values [ubyte] + value_size + has_nulls + null_counts
Stats computed during StreamChunker::next() via std::minmax_element on the exact chunk slice, with NaN-aware handling for floats
ChunkIndexWriter serializes per-chunk stats into the chunk index alongside position data
NimbleConfig: added nimble.chunk.stats.enabled DDL property and wiring in NimbleWriterOptionBuilder
Unit tests for chunk stats round-trip and direct computeChunkStats tests

New FlatBuffer fields are optional so old readers ignore them and old files work without them.

Differential Revision: D104712551

Summary: CONTEXT: Nimble lacks per-chunk column statistics for filter pushdown. Currently, filters on non-indexed columns require reading and decoding every chunk within a stripe. WHAT: This diff adds the write-path infrastructure for per-chunk min/max statistics using a Parquet-inspired binary design: - ChunkStats struct with raw-byte min/max values (PLAIN-encoded in the stream native type) + hasNulls flag + nullCount, making the schema type-agnostic - ChunkIndex.fbs extended with StreamChunkStats table using min_values/max_values [ubyte] + value_size + has_nulls + null_counts - Stats computed during StreamChunker::next() via std::minmax_element on the exact chunk slice, with NaN-aware handling for floats - ChunkIndexWriter serializes per-chunk stats into the chunk index alongside position data - NimbleConfig: added nimble.chunk.stats.enabled DDL property and wiring in NimbleWriterOptionBuilder - Unit tests for chunk stats round-trip and direct computeChunkStats tests New FlatBuffer fields are optional so old readers ignore them and old files work without them. Differential Revision: D104712551

meta-codesync · 2026-05-12T20:57:42Z

@kewang1024 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104712551.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2026

meta-codesync Bot added fb-exported meta-exported labels May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-chunk statistics computation and serialization in Nimble#719

Add per-chunk statistics computation and serialization in Nimble#719
kewang1024 wants to merge 1 commit into
facebookincubator:mainfrom
kewang1024:export-D104712551

kewang1024 commented May 12, 2026

Uh oh!

meta-codesync Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kewang1024 commented May 12, 2026

Uh oh!

meta-codesync Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant