feat: Lazy per-column I/O for complex columns in Nimble (#677)#677
Open
prashantgolash wants to merge 1 commit into
Open
feat: Lazy per-column I/O for complex columns in Nimble (#677)#677prashantgolash wants to merge 1 commit into
prashantgolash wants to merge 1 commit into
Conversation
|
@prashantgolash has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100277342. |
681e874 to
143d35b
Compare
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
Apr 27, 2026
…ebookincubator#17350) Summary: X-link: facebookincubator/nimble#677 FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
Apr 27, 2026
…ebookincubator#17350) Summary: X-link: facebookincubator/nimble#677 FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
143d35b to
5a29802
Compare
5a29802 to
0025f13
Compare
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
Apr 28, 2026
Summary: X-link: facebookincubator/nimble#677 FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
prashantgolash
added a commit
to prashantgolash/nimble
that referenced
this pull request
Apr 28, 2026
…ebookincubator#677) Summary: FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
0025f13 to
a43fb66
Compare
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
Apr 29, 2026
…ebookincubator#17350) Summary: X-link: facebookincubator/nimble#677 FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
prashantgolash
added a commit
to prashantgolash/nimble
that referenced
this pull request
Apr 29, 2026
…ebookincubator#677) Summary: X-link: facebookincubator/velox#17350 FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
a43fb66 to
42cbb2f
Compare
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
Apr 29, 2026
…ebookincubator#17350) Summary: X-link: facebookincubator/nimble#677 FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
prashantgolash
added a commit
to prashantgolash/nimble
that referenced
this pull request
Apr 29, 2026
…ebookincubator#677) Summary: X-link: facebookincubator/velox#17350 FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used. This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off). ## How it works **Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row. **After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column. ## What qualifies for deferral A column is deferred when all of these are true: - defer_flatmap_io session property is enabled - Column is a top-level child of the root struct (eligible for LazyVector) - At least one sibling has a pushed-down filter - The column itself has no filter and is projected - The column is a complex type (MAP, ARRAY, or ROW) ## Batch size estimation Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright. ## Why per-column clones (not a shared clone) Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows. Per-column clones load each column independently at the right time: - pipeline_labels loads when the remaining filter accesses it - sparse_features loads only when serialization needs it (after the remaining filter) - If the remaining filter eliminates all rows, sparse_features is never loaded Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction). ## Usage SET SESSION hive.native_defer_flatmap_io = true; Differential Revision: D100277342
42cbb2f to
d08b7b0
Compare
d08b7b0 to
164366f
Compare
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
May 1, 2026
Summary: X-link: facebookincubator/nimble#677 Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid. This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe. How it works: - During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one. - The shared input is loaded during stripe init (eager columns only). - Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed. - Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded. Gated behind the `lazy_column_io` session property (default off). Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2297779682 Differential Revision: D100277342
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
May 4, 2026
…bator#17350) Summary: X-link: facebookincubator/nimble#677 Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid. This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe. How it works: - During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one. - The shared input is loaded during stripe init (eager columns only). - Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed. - Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded. Gated behind the `lazy_column_io` session property (default off). Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230 Differential Revision: D100277342
164366f to
4fed353
Compare
prashantgolash
added a commit
to prashantgolash/nimble
that referenced
this pull request
May 4, 2026
…bator#677) Summary: X-link: facebookincubator/velox#17350 Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid. This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe. How it works: - During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one. - The shared input is loaded during stripe init (eager columns only). - Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed. - Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded. Gated behind the `lazy_column_io` session property (default off). Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230 Differential Revision: D100277342
4fed353 to
478ca3e
Compare
prashantgolash
added a commit
to prashantgolash/nimble
that referenced
this pull request
May 8, 2026
…bator#677) Summary: X-link: facebookincubator/velox#17350 Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid. This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe. How it works: - During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one. - The shared input is loaded during stripe init (eager columns only). - Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed. - Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded. Gated behind the `lazy_column_io` session property (default off). Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230 Differential Revision: D100277342
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
May 8, 2026
…bator#17350) Summary: Pull Request resolved: facebookincubator#17350 X-link: facebookincubator/nimble#677 Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid. This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe. How it works: - During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one. - The shared input is loaded during stripe init (eager columns only). - Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed. - Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded. Gated behind the `lazy_column_io` session property (default off). Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230 Differential Revision: D100277342
…bator#677) Summary: X-link: facebookincubator/velox#17350 Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid. This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe. How it works: - During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one. - The shared input is loaded during stripe init (eager columns only). - Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed. - Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded. Gated behind the `lazy_column_io` session property (default off). Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230 Reviewed By: HuamengJiang Differential Revision: D100277342
prashantgolash
added a commit
to prashantgolash/velox
that referenced
this pull request
May 12, 2026
…bator#17350) Summary: X-link: facebookincubator/nimble#677 Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid. This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe. How it works: - During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one. - The shared input is loaded during stripe init (eager columns only). - Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed. - Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded. Gated behind the `lazy_column_io` session property (default off). Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230 Reviewed By: HuamengJiang Differential Revision: D100277342
478ca3e to
f6d53b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: facebookincubator/velox#17350
Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.
This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.
How it works:
Gated behind the
lazy_column_iosession property (default off).Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230
Reviewed By: HuamengJiang
Differential Revision: D100277342