feat: Fix FlatMapFieldWriter::ingestRow empty-map round-trip (T271900360) (#754)#754
Open
apurva-meta wants to merge 1 commit into
Open
feat: Fix FlatMapFieldWriter::ingestRow empty-map round-trip (T271900360) (#754)#754apurva-meta wants to merge 1 commit into
apurva-meta wants to merge 1 commit into
Conversation
|
@apurva-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105712499. |
apurva-meta
added a commit
to apurva-meta/nimble
that referenced
this pull request
May 19, 2026
…360) (facebookincubator#754) Summary: Stacks on top of D105703824 to land the primary fix for T271900360. Problem: FlatMapFieldWriter::ingestRow re-encodes a ROW vector (the struct projection a flatmap-as-struct reader produces) back into a flatmap. The pre-fix path unconditionally invoked the no-inMap overload of FlatMapPassthroughValueFieldWriter::write, which sets inMap=1 for every key at every non-null parent row. Result: a (non-null empty map) row becomes (non-null map of N NULL-valued entries) on disk; the round-trip is lossy and Vader's checksum validation flags it. Fix: For each feature column, compute a per-row inMap bitmap from the feature's null pattern (NULL value → inMap=0, non-null → inMap=1) and route through the existing 3-arg with-inMap overload. Empty/missing keys are now correctly encoded as 'key absent at this row'. Stats accounting lockstep: The writer's consistency check enforces fileRawSize == columnStats.front().getLogicalSize(). To keep that invariant intact under the new per-feature-presence accounting: - FieldWriter.cpp ingestRow pre-computes totalKeyCount as the sum of per-feature non-null entries within childRanges and passes it to collectKeyStatistics / collectPassthroughStringKeyStatistics. - collectPassthroughStringKeyStatistics takes the per-feature values + childRanges and computes the per-key string-length contribution scoped to present entries only. - RawSizeUtils.cpp getRawSizeForPassthroughFlatMap counts only present (non-null per-feature) entries for both keys and values. Test changes: - dwio/tools/test/TestFormatConverterFlatmapAsStruct.cpp adds EmptyMapRowsPreservedThroughIngestRow, a self-contained repro that builds a ROW projection of an empty-map row, feeds it through the flatmap writer, and asserts the as-MAP checksum matches a reference written from a MapVector. Fails pre-fix, passes post-fix. - dwio/nimble/velox/tests/FieldWriterStatsTest.cpp: the three pre-existing flatMapPassThrough*FieldWriterStats tests encoded the old 'all keys present at every non-null flatmap row' accounting via magic numerical stats. Their data setup is preserved, but their assertions are restructured to check semantic invariants that the new accounting guarantees: (1) valueStat.nullCount == 0 (NULL per-feature values become inMap=0, not value=NULL) (2) keyStat.valueCount == valueStat.valueCount (each present key has exactly one written value) (3) keyStat.valueCount <= numKeys * numNonNullFlatmapRows (upper bound from pre-fix accounting) (4) keyStat.valueCount > 0 These invariants are sufficient to detect regression while not encoding encoding-dependent byte/physical-size formulas. Differential Revision: D105712499
53cff60 to
5a4ffe0
Compare
…360) (facebookincubator#754) Summary: Stacks on top of D105703824 to land the primary fix for T271900360. Problem: FlatMapFieldWriter::ingestRow re-encodes a ROW vector (the struct projection a flatmap-as-struct reader produces) back into a flatmap. The pre-fix path unconditionally invoked the no-inMap overload of FlatMapPassthroughValueFieldWriter::write, which sets inMap=1 for every key at every non-null parent row. Result: a (non-null empty map) row becomes (non-null map of N NULL-valued entries) on disk; the round-trip is lossy and Vader's checksum validation flags it. Fix: For each feature column, compute a per-row inMap bitmap from the feature's null pattern (NULL value → inMap=0, non-null → inMap=1) and route through the existing 3-arg with-inMap overload. Empty/missing keys are now correctly encoded as 'key absent at this row'. Stats accounting lockstep: The writer's consistency check enforces fileRawSize == columnStats.front().getLogicalSize(). To keep that invariant intact under the new per-feature-presence accounting: - FieldWriter.cpp ingestRow pre-computes totalKeyCount as the sum of per-feature non-null entries within childRanges and passes it to collectKeyStatistics / collectPassthroughStringKeyStatistics. - collectPassthroughStringKeyStatistics takes the per-feature values + childRanges and computes the per-key string-length contribution scoped to present entries only. - RawSizeUtils.cpp getRawSizeForPassthroughFlatMap counts only present (non-null per-feature) entries for both keys and values. Test changes: - dwio/tools/test/TestFormatConverterFlatmapAsStruct.cpp adds EmptyMapRowsPreservedThroughIngestRow, a self-contained repro that builds a ROW projection of an empty-map row, feeds it through the flatmap writer, and asserts the as-MAP checksum matches a reference written from a MapVector. Fails pre-fix, passes post-fix. - dwio/nimble/velox/tests/FieldWriterStatsTest.cpp: the three pre-existing flatMapPassThrough*FieldWriterStats tests encoded the old 'all keys present at every non-null flatmap row' accounting via magic numerical stats. Their data setup is preserved, but their assertions are restructured to check semantic invariants that the new accounting guarantees: (1) valueStat.nullCount == 0 (NULL per-feature values become inMap=0, not value=NULL) (2) keyStat.valueCount == valueStat.valueCount (each present key has exactly one written value) (3) keyStat.valueCount <= numKeys * numNonNullFlatmapRows (upper bound from pre-fix accounting) (4) keyStat.valueCount > 0 These invariants are sufficient to detect regression while not encoding encoding-dependent byte/physical-size formulas. Differential Revision: D105712499
5a4ffe0 to
6a562b2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Stacks on top of D105703824 to land the primary fix for T271900360.
Problem:
FlatMapFieldWriter::ingestRow re-encodes a ROW vector (the struct
projection a flatmap-as-struct reader produces) back into a flatmap.
The pre-fix path unconditionally invoked the no-inMap overload of
FlatMapPassthroughValueFieldWriter::write, which sets inMap=1 for
every key at every non-null parent row. Result: a (non-null empty map)
row becomes (non-null map of N NULL-valued entries) on disk; the
round-trip is lossy and Vader's checksum validation flags it.
Fix:
For each feature column, compute a per-row inMap bitmap from the
feature's null pattern (NULL value → inMap=0, non-null → inMap=1) and
route through the existing 3-arg with-inMap overload. Empty/missing
keys are now correctly encoded as 'key absent at this row'.
Stats accounting lockstep:
The writer's consistency check enforces
fileRawSize == columnStats.front().getLogicalSize(). To keep that
invariant intact under the new per-feature-presence accounting:
sum of per-feature non-null entries within childRanges and
passes it to collectKeyStatistics / collectPassthroughStringKeyStatistics.
values + childRanges and computes the per-key string-length
contribution scoped to present entries only.
present (non-null per-feature) entries for both keys and values.
Test changes:
EmptyMapRowsPreservedThroughIngestRow, a self-contained repro
that builds a ROW projection of an empty-map row, feeds it
through the flatmap writer, and asserts the as-MAP checksum
matches a reference written from a MapVector. Fails pre-fix,
passes post-fix.
pre-existing flatMapPassThrough*FieldWriterStats tests encoded
the old 'all keys present at every non-null flatmap row'
accounting via magic numerical stats. Their data setup is
preserved, but their assertions are restructured to check
semantic invariants that the new accounting guarantees:
(1) valueStat.nullCount == 0
(NULL per-feature values become inMap=0, not value=NULL)
(2) keyStat.valueCount == valueStat.valueCount
(each present key has exactly one written value)
(3) keyStat.valueCount <= numKeys * numNonNullFlatmapRows
(upper bound from pre-fix accounting)
(4) keyStat.valueCount > 0
These invariants are sufficient to detect regression while not
encoding encoding-dependent byte/physical-size formulas.
Differential Revision: D105712499