Skip to content

feat: Fix FlatMapFieldWriter::ingestRow empty-map round-trip (T271900360) (#754)#754

Open
apurva-meta wants to merge 1 commit into
facebookincubator:mainfrom
apurva-meta:export-D105712499
Open

feat: Fix FlatMapFieldWriter::ingestRow empty-map round-trip (T271900360) (#754)#754
apurva-meta wants to merge 1 commit into
facebookincubator:mainfrom
apurva-meta:export-D105712499

Conversation

@apurva-meta
Copy link
Copy Markdown
Contributor

@apurva-meta apurva-meta commented May 19, 2026

Summary:

Stacks on top of D105703824 to land the primary fix for T271900360.

Problem:
FlatMapFieldWriter::ingestRow re-encodes a ROW vector (the struct
projection a flatmap-as-struct reader produces) back into a flatmap.
The pre-fix path unconditionally invoked the no-inMap overload of
FlatMapPassthroughValueFieldWriter::write, which sets inMap=1 for
every key at every non-null parent row. Result: a (non-null empty map)
row becomes (non-null map of N NULL-valued entries) on disk; the
round-trip is lossy and Vader's checksum validation flags it.

Fix:
For each feature column, compute a per-row inMap bitmap from the
feature's null pattern (NULL value → inMap=0, non-null → inMap=1) and
route through the existing 3-arg with-inMap overload. Empty/missing
keys are now correctly encoded as 'key absent at this row'.

Stats accounting lockstep:
The writer's consistency check enforces
fileRawSize == columnStats.front().getLogicalSize(). To keep that
invariant intact under the new per-feature-presence accounting:

  • FieldWriter.cpp ingestRow pre-computes totalKeyCount as the
    sum of per-feature non-null entries within childRanges and
    passes it to collectKeyStatistics / collectPassthroughStringKeyStatistics.
  • collectPassthroughStringKeyStatistics takes the per-feature
    values + childRanges and computes the per-key string-length
    contribution scoped to present entries only.
  • RawSizeUtils.cpp getRawSizeForPassthroughFlatMap counts only
    present (non-null per-feature) entries for both keys and values.

Test changes:

  • dwio/tools/test/TestFormatConverterFlatmapAsStruct.cpp adds
    EmptyMapRowsPreservedThroughIngestRow, a self-contained repro
    that builds a ROW projection of an empty-map row, feeds it
    through the flatmap writer, and asserts the as-MAP checksum
    matches a reference written from a MapVector. Fails pre-fix,
    passes post-fix.
  • dwio/nimble/velox/tests/FieldWriterStatsTest.cpp: the three
    pre-existing flatMapPassThrough*FieldWriterStats tests encoded
    the old 'all keys present at every non-null flatmap row'
    accounting via magic numerical stats. Their data setup is
    preserved, but their assertions are restructured to check
    semantic invariants that the new accounting guarantees:
    (1) valueStat.nullCount == 0
    (NULL per-feature values become inMap=0, not value=NULL)
    (2) keyStat.valueCount == valueStat.valueCount
    (each present key has exactly one written value)
    (3) keyStat.valueCount <= numKeys * numNonNullFlatmapRows
    (upper bound from pre-fix accounting)
    (4) keyStat.valueCount > 0
    These invariants are sufficient to detect regression while not
    encoding encoding-dependent byte/physical-size formulas.

Differential Revision: D105712499

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 19, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 19, 2026

@apurva-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105712499.

@meta-codesync meta-codesync Bot changed the title Fix FlatMapFieldWriter::ingestRow empty-map round-trip (T271900360) feat: Fix FlatMapFieldWriter::ingestRow empty-map round-trip (T271900360) (#754) May 19, 2026
apurva-meta added a commit to apurva-meta/nimble that referenced this pull request May 19, 2026
…360) (facebookincubator#754)

Summary:

Stacks on top of D105703824 to land the primary fix for T271900360.

Problem:
FlatMapFieldWriter::ingestRow re-encodes a ROW vector (the struct
projection a flatmap-as-struct reader produces) back into a flatmap.
The pre-fix path unconditionally invoked the no-inMap overload of
FlatMapPassthroughValueFieldWriter::write, which sets inMap=1 for
every key at every non-null parent row. Result: a (non-null empty map)
row becomes (non-null map of N NULL-valued entries) on disk; the
round-trip is lossy and Vader's checksum validation flags it.

Fix:
For each feature column, compute a per-row inMap bitmap from the
feature's null pattern (NULL value → inMap=0, non-null → inMap=1) and
route through the existing 3-arg with-inMap overload. Empty/missing
keys are now correctly encoded as 'key absent at this row'.

Stats accounting lockstep:
The writer's consistency check enforces
fileRawSize == columnStats.front().getLogicalSize(). To keep that
invariant intact under the new per-feature-presence accounting:

  - FieldWriter.cpp ingestRow pre-computes totalKeyCount as the
    sum of per-feature non-null entries within childRanges and
    passes it to collectKeyStatistics / collectPassthroughStringKeyStatistics.
  - collectPassthroughStringKeyStatistics takes the per-feature
    values + childRanges and computes the per-key string-length
    contribution scoped to present entries only.
  - RawSizeUtils.cpp getRawSizeForPassthroughFlatMap counts only
    present (non-null per-feature) entries for both keys and values.

Test changes:
  - dwio/tools/test/TestFormatConverterFlatmapAsStruct.cpp adds
    EmptyMapRowsPreservedThroughIngestRow, a self-contained repro
    that builds a ROW projection of an empty-map row, feeds it
    through the flatmap writer, and asserts the as-MAP checksum
    matches a reference written from a MapVector. Fails pre-fix,
    passes post-fix.
  - dwio/nimble/velox/tests/FieldWriterStatsTest.cpp: the three
    pre-existing flatMapPassThrough*FieldWriterStats tests encoded
    the old 'all keys present at every non-null flatmap row'
    accounting via magic numerical stats. Their data setup is
    preserved, but their assertions are restructured to check
    semantic invariants that the new accounting guarantees:
      (1) valueStat.nullCount == 0
          (NULL per-feature values become inMap=0, not value=NULL)
      (2) keyStat.valueCount == valueStat.valueCount
          (each present key has exactly one written value)
      (3) keyStat.valueCount <= numKeys * numNonNullFlatmapRows
          (upper bound from pre-fix accounting)
      (4) keyStat.valueCount > 0
    These invariants are sufficient to detect regression while not
    encoding encoding-dependent byte/physical-size formulas.

Differential Revision: D105712499
…360) (facebookincubator#754)

Summary:

Stacks on top of D105703824 to land the primary fix for T271900360.

Problem:
FlatMapFieldWriter::ingestRow re-encodes a ROW vector (the struct
projection a flatmap-as-struct reader produces) back into a flatmap.
The pre-fix path unconditionally invoked the no-inMap overload of
FlatMapPassthroughValueFieldWriter::write, which sets inMap=1 for
every key at every non-null parent row. Result: a (non-null empty map)
row becomes (non-null map of N NULL-valued entries) on disk; the
round-trip is lossy and Vader's checksum validation flags it.

Fix:
For each feature column, compute a per-row inMap bitmap from the
feature's null pattern (NULL value → inMap=0, non-null → inMap=1) and
route through the existing 3-arg with-inMap overload. Empty/missing
keys are now correctly encoded as 'key absent at this row'.

Stats accounting lockstep:
The writer's consistency check enforces
fileRawSize == columnStats.front().getLogicalSize(). To keep that
invariant intact under the new per-feature-presence accounting:

  - FieldWriter.cpp ingestRow pre-computes totalKeyCount as the
    sum of per-feature non-null entries within childRanges and
    passes it to collectKeyStatistics / collectPassthroughStringKeyStatistics.
  - collectPassthroughStringKeyStatistics takes the per-feature
    values + childRanges and computes the per-key string-length
    contribution scoped to present entries only.
  - RawSizeUtils.cpp getRawSizeForPassthroughFlatMap counts only
    present (non-null per-feature) entries for both keys and values.

Test changes:
  - dwio/tools/test/TestFormatConverterFlatmapAsStruct.cpp adds
    EmptyMapRowsPreservedThroughIngestRow, a self-contained repro
    that builds a ROW projection of an empty-map row, feeds it
    through the flatmap writer, and asserts the as-MAP checksum
    matches a reference written from a MapVector. Fails pre-fix,
    passes post-fix.
  - dwio/nimble/velox/tests/FieldWriterStatsTest.cpp: the three
    pre-existing flatMapPassThrough*FieldWriterStats tests encoded
    the old 'all keys present at every non-null flatmap row'
    accounting via magic numerical stats. Their data setup is
    preserved, but their assertions are restructured to check
    semantic invariants that the new accounting guarantees:
      (1) valueStat.nullCount == 0
          (NULL per-feature values become inMap=0, not value=NULL)
      (2) keyStat.valueCount == valueStat.valueCount
          (each present key has exactly one written value)
      (3) keyStat.valueCount <= numKeys * numNonNullFlatmapRows
          (upper bound from pre-fix accounting)
      (4) keyStat.valueCount > 0
    These invariants are sufficient to detect regression while not
    encoding encoding-dependent byte/physical-size formulas.

Differential Revision: D105712499
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant