feat[substrait]: translate scan statistics (row_count, record_size) via RelCommon hints#21112
Conversation
gabotechs
left a comment
There was a problem hiding this comment.
Really cool to see stats provided as hints in the Substrait plan!
| #[derive(Debug)] | ||
| struct StatisticsOverrideTableProvider { | ||
| inner: Arc<dyn TableProvider>, | ||
| statistics: Statistics, | ||
| } |
There was a problem hiding this comment.
An "override" or a "wrapper" is probably not the most future proof way of enhancing a node with statistics, it's a bit hacky for it to be committed here.
For example, this can mess with the downcast(), as the specific struct implementation of the TableProvider is now a different one.
There was a problem hiding this comment.
Done in the updated version. StatisticsOverrideTableProvider is still used by DefaultSubstraitConsumer as a fallback, but since the hint is now passed through resolve_table_ref, custom implementors can return a provider with statistics already embedded and avoid the wrapper entirely — which also sidesteps the downcast issue.
There was a problem hiding this comment.
I get the impression that a less opinionated approach would be to pass down the hints as another argument to consumer.resolve_table_ref(, and let implementors of resolve_table_ref decide wether they want to use the provided statistics or not in their specific TableProvider implementation.
There was a problem hiding this comment.
Implemented as suggested. resolve_table_ref now takes a row_count_hint: Option argument, letting each implementor decide what to do with it. The default consumer applies the wrapper when the provider has no statistics of its own; custom consumers can handle it however fits their TableProvider implementation.
There was a problem hiding this comment.
The final API evolved further: the parameter is now hints: SubstraitHints (a #[non_exhaustive] struct with row_count: Option<f64> and record_size: Option<f64>) rather than the intermediate row_count_hint: Option<f64>. The struct groups both hints together and allows new fields to be added in future versions without additional breaking changes.
There was a problem hiding this comment.
To close the loop: the final API has resolve_table_ref taking hints: SubstraitHints (a #[non_exhaustive] struct with row_count: Option<f64> and record_size: Option<f64>), rather than the intermediate row_count_hint: Option<f64> mentioned in my earlier reply. StatisticsOverrideTableProvider is private to DefaultSubstraitConsumer — custom implementors can ignore the hints or embed them directly into their own TableProvider. Happy to walk through anything before you re-review.
e204e72 to
b246002
Compare
a09ed90 to
f665d1e
Compare
Serialize `num_rows` and `total_byte_size` from a `TableProvider` into `RelCommon.hint.stats` on the producer side, and inject them back into the resolved provider on the consumer side when the provider has no statistics of its own. - Producer: writes `row_count` and `record_size` (= bytes / rows) into `RelCommon.hint.stats` for every `TableScan`. - Consumer: exposes a new `hints: SubstraitHints` parameter on `SubstraitConsumer::resolve_table_ref`; `DefaultSubstraitConsumer` wraps the provider in a `StatisticsOverrideTableProvider` when hints are present and the provider lacks the corresponding fields. - `SubstraitHints` is `#[non_exhaustive]` so future fields can be added without further breaking changes. - Hints are advisory: `Exact` statistics become `Inexact` after a round-trip. Tables with exactly 0 rows lose their stats (proto3 default-zero ambiguity — accepted limitation). Closes apache#21112 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
07736eb to
020fd13
Compare
Serialize `num_rows` and `total_byte_size` from a `TableProvider` into `RelCommon.hint.stats` on the producer side, and inject them back into the resolved provider on the consumer side when the provider has no statistics of its own. - Producer: writes `row_count` and `record_size` (= bytes / rows) into `RelCommon.hint.stats` for every `TableScan`. - Consumer: exposes a new `hints: SubstraitHints` parameter on `SubstraitConsumer::resolve_table_ref`; `DefaultSubstraitConsumer` wraps the provider in a `StatisticsOverrideTableProvider` when hints are present and the provider lacks the corresponding fields. - `SubstraitHints` is `#[non_exhaustive]` so future fields can be added without further breaking changes. - Hints are advisory: `Exact` statistics become `Inexact` after a round-trip. Tables with exactly 0 rows lose their stats (proto3 default-zero ambiguity — accepted limitation). - A `record_size`-only hint (no `row_count`) cannot reconstruct `total_byte_size` and is silently ignored; this is tested explicitly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
020fd13 to
5933e38
Compare
Which issue does this PR close?
Partially closes #8698. This PR implements
row_countandrecord_sizetranslation for scan nodes (ReadRel). Statistics for intermediate nodes are out of scope and left for a follow-up.Rationale for this change
Substrait's
RelCommon.hint.statscarries row count and record size statistics as advisory hints. DataFusion was not reading or writing these fields, meaning statistics were silently dropped when round-tripping logical plans through Substrait. Preserving them is useful for downstream optimizer rules that rely on row count estimates (e.g. join ordering).What changes are included in this PR?
Producer (
producer/rel/read_rel.rs): when serializing aTableScan, readsstatistics()from theTableProvider. Ifnum_rowsisExact(n)orInexact(n), it is written toRelCommon.hint.stats.row_count. Iftotal_byte_sizeis also available,record_size = total_byte_size / row_countis written toRelCommon.hint.stats.record_size.Consumer (
substrait_consumer.rs+consumer/rel/read_rel.rs): extractsrow_countandrecord_sizefromRelCommon.hint.statson anyReadReland passes them via a newhints: SubstraitHintsargument toSubstraitConsumer::resolve_table_ref. This lets each implementor decide how to incorporate the hints.DefaultSubstraitConsumerapplies them by wrapping the resolved provider with a privateStatisticsOverrideTableProviderwhen the provider is missing statistics, exposingnum_rowsasPrecision::Inexact(n)and reconstructingtotal_byte_size = effective_row_count * record_size. The effective row count used fortotal_byte_sizereconstruction is the provider's ownnum_rowswhen present (keeping the two statistics internally consistent), falling back to the hint'srow_countwhen the provider has none. Local provider statistics always take precedence over the hint.SubstraitHintsis a#[non_exhaustive]struct withrow_count: Option<f64>andrecord_size: Option<f64>, allowing new fields to be added in future versions without further breaking changes.Are these changes tested?
Seven new integration tests in
roundtrip_logical_plan.rs:producer_sets_stats_hints: assertsrow_countandrecord_sizeare correctly written toRelCommon.hint.statsby the producer.consumer_injects_row_count_hint: produces a plan with row count 42, consumes it against a provider with no statistics, and assertsPrecision::Inexact(42)(also verifies thatExacton the producer becomesInexactafter the round-trip).consumer_injects_record_size_hint: verifies bothnum_rowsandtotal_byte_sizeare reconstructed from hints.consumer_preserves_provider_statistics_over_hint: verifies that a provider with its own statistics is never overridden by the hint.consumer_injects_hint_into_absent_statistics: verifies hints are injected even when the provider returnsSome(Statistics { all Absent })rather thanNone.consumer_injects_byte_size_using_provider_row_count: verifies that when the provider already hasnum_rows, the injectedtotal_byte_sizeuses the provider's row count (not the hint's), keeping the two statistics consistent.consumer_skips_byte_size_when_row_count_hint_absent: verifies that arecord_size-only hint (norow_count) does not injecttotal_byte_size, even when the provider has its ownnum_rows.Are there any user-facing changes?
SubstraitConsumer::resolve_table_refgains a newhints: SubstraitHintsparameter — this is a breaking change for custom implementors of the trait. A migration guide with a before/after example is added to the54.0.0upgrade guide. The behavior is otherwise additive: Substrait plans produced by DataFusion now carry row count and record size hints, and plans consumed by DataFusion now surface those hints throughTableProvider::statistics()when no local statistics are present.Limitations
ReadRel).RelCommon.hint.statsis present on every Substrait relation, but this PR does not read or write it for intermediate nodes (joins, projections, filters, aggregations, etc.). This is left for a follow-up.Exactstatistics becomeInexactafter a round-trip, becauseRelCommon.hintis advisory by design.record_size-only hint (norow_count) cannot reconstructtotal_byte_sizeand is silently ignored byDefaultSubstraitConsumer. This can arise from non-DataFusion Substrait producers. DataFusion's own producer never writesrecord_sizewithoutrow_count.