Skip to content

feat[substrait]: translate scan statistics (row_count, record_size) via RelCommon hints#21112

Open
etiennepelissier wants to merge 1 commit intoapache:mainfrom
etiennepelissier:implement-statistics-support-for-substrait
Open

feat[substrait]: translate scan statistics (row_count, record_size) via RelCommon hints#21112
etiennepelissier wants to merge 1 commit intoapache:mainfrom
etiennepelissier:implement-statistics-support-for-substrait

Conversation

@etiennepelissier
Copy link

@etiennepelissier etiennepelissier commented Mar 23, 2026

Which issue does this PR close?

Partially closes #8698. This PR implements row_count and record_size translation for scan nodes (ReadRel). Statistics for intermediate nodes are out of scope and left for a follow-up.

Rationale for this change

Substrait's RelCommon.hint.stats carries row count and record size statistics as advisory hints. DataFusion was not reading or writing these fields, meaning statistics were silently dropped when round-tripping logical plans through Substrait. Preserving them is useful for downstream optimizer rules that rely on row count estimates (e.g. join ordering).

What changes are included in this PR?

Producer (producer/rel/read_rel.rs): when serializing a TableScan, reads statistics() from the TableProvider. If num_rows is Exact(n) or Inexact(n), it is written to RelCommon.hint.stats.row_count. If total_byte_size is also available, record_size = total_byte_size / row_count is written to RelCommon.hint.stats.record_size.

Consumer (substrait_consumer.rs + consumer/rel/read_rel.rs): extracts row_count and record_size from RelCommon.hint.stats on any ReadRel and passes them via a new hints: SubstraitHints argument to SubstraitConsumer::resolve_table_ref. This lets each implementor decide how to incorporate the hints. DefaultSubstraitConsumer applies them by wrapping the resolved provider with a private StatisticsOverrideTableProvider when the provider is missing statistics, exposing num_rows as Precision::Inexact(n) and reconstructing total_byte_size = effective_row_count * record_size. The effective row count used for total_byte_size reconstruction is the provider's own num_rows when present (keeping the two statistics internally consistent), falling back to the hint's row_count when the provider has none. Local provider statistics always take precedence over the hint.

SubstraitHints is a #[non_exhaustive] struct with row_count: Option<f64> and record_size: Option<f64>, allowing new fields to be added in future versions without further breaking changes.

Are these changes tested?

Seven new integration tests in roundtrip_logical_plan.rs:

  • producer_sets_stats_hints: asserts row_count and record_size are correctly written to RelCommon.hint.stats by the producer.
  • consumer_injects_row_count_hint: produces a plan with row count 42, consumes it against a provider with no statistics, and asserts Precision::Inexact(42) (also verifies that Exact on the producer becomes Inexact after the round-trip).
  • consumer_injects_record_size_hint: verifies both num_rows and total_byte_size are reconstructed from hints.
  • consumer_preserves_provider_statistics_over_hint: verifies that a provider with its own statistics is never overridden by the hint.
  • consumer_injects_hint_into_absent_statistics: verifies hints are injected even when the provider returns Some(Statistics { all Absent }) rather than None.
  • consumer_injects_byte_size_using_provider_row_count: verifies that when the provider already has num_rows, the injected total_byte_size uses the provider's row count (not the hint's), keeping the two statistics consistent.
  • consumer_skips_byte_size_when_row_count_hint_absent: verifies that a record_size-only hint (no row_count) does not inject total_byte_size, even when the provider has its own num_rows.

Are there any user-facing changes?

SubstraitConsumer::resolve_table_ref gains a new hints: SubstraitHints parameter — this is a breaking change for custom implementors of the trait. A migration guide with a before/after example is added to the 54.0.0 upgrade guide. The behavior is otherwise additive: Substrait plans produced by DataFusion now carry row count and record size hints, and plans consumed by DataFusion now surface those hints through TableProvider::statistics() when no local statistics are present.

Limitations

  • Statistics are only propagated for scan nodes (ReadRel). RelCommon.hint.stats is present on every Substrait relation, but this PR does not read or write it for intermediate nodes (joins, projections, filters, aggregations, etc.). This is left for a follow-up.
  • Exact statistics become Inexact after a round-trip, because RelCommon.hint is advisory by design.
  • Zero-row / zero-byte values are silently dropped due to proto3 default-zero ambiguity.
  • A record_size-only hint (no row_count) cannot reconstruct total_byte_size and is silently ignored by DefaultSubstraitConsumer. This can arise from non-DataFusion Substrait producers. DataFusion's own producer never writes record_size without row_count.

@github-actions github-actions bot added the substrait Changes to the substrait crate label Mar 23, 2026
Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool to see stats provided as hints in the Substrait plan!

Comment on lines +45 to +49
#[derive(Debug)]
struct StatisticsOverrideTableProvider {
inner: Arc<dyn TableProvider>,
statistics: Statistics,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An "override" or a "wrapper" is probably not the most future proof way of enhancing a node with statistics, it's a bit hacky for it to be committed here.

For example, this can mess with the downcast(), as the specific struct implementation of the TableProvider is now a different one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the updated version. StatisticsOverrideTableProvider is still used by DefaultSubstraitConsumer as a fallback, but since the hint is now passed through resolve_table_ref, custom implementors can return a provider with statistics already embedded and avoid the wrapper entirely — which also sidesteps the downcast issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the impression that a less opinionated approach would be to pass down the hints as another argument to consumer.resolve_table_ref(, and let implementors of resolve_table_ref decide wether they want to use the provided statistics or not in their specific TableProvider implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as suggested. resolve_table_ref now takes a row_count_hint: Option argument, letting each implementor decide what to do with it. The default consumer applies the wrapper when the provider has no statistics of its own; custom consumers can handle it however fits their TableProvider implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final API evolved further: the parameter is now hints: SubstraitHints (a #[non_exhaustive] struct with row_count: Option<f64> and record_size: Option<f64>) rather than the intermediate row_count_hint: Option<f64>. The struct groups both hints together and allows new fields to be added in future versions without additional breaking changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop: the final API has resolve_table_ref taking hints: SubstraitHints (a #[non_exhaustive] struct with row_count: Option<f64> and record_size: Option<f64>), rather than the intermediate row_count_hint: Option<f64> mentioned in my earlier reply. StatisticsOverrideTableProvider is private to DefaultSubstraitConsumer — custom implementors can ignore the hints or embed them directly into their own TableProvider. Happy to walk through anything before you re-review.

@etiennepelissier etiennepelissier force-pushed the implement-statistics-support-for-substrait branch 3 times, most recently from e204e72 to b246002 Compare March 25, 2026 12:47
@etiennepelissier etiennepelissier changed the title feat[substrait]: translate row_count statistics via RelCommon hint feat[substrait]: translate scan statistics (row_count, record_size) via RelCommon hints Mar 25, 2026
@etiennepelissier etiennepelissier force-pushed the implement-statistics-support-for-substrait branch 3 times, most recently from a09ed90 to f665d1e Compare March 25, 2026 13:27
etiennepelissier added a commit to etiennepelissier/datafusion that referenced this pull request Mar 25, 2026
Serialize `num_rows` and `total_byte_size` from a `TableProvider` into
`RelCommon.hint.stats` on the producer side, and inject them back into
the resolved provider on the consumer side when the provider has no
statistics of its own.

- Producer: writes `row_count` and `record_size` (= bytes / rows) into
  `RelCommon.hint.stats` for every `TableScan`.
- Consumer: exposes a new `hints: SubstraitHints` parameter on
  `SubstraitConsumer::resolve_table_ref`; `DefaultSubstraitConsumer`
  wraps the provider in a `StatisticsOverrideTableProvider` when hints
  are present and the provider lacks the corresponding fields.
- `SubstraitHints` is `#[non_exhaustive]` so future fields can be added
  without further breaking changes.
- Hints are advisory: `Exact` statistics become `Inexact` after a
  round-trip.  Tables with exactly 0 rows lose their stats (proto3
  default-zero ambiguity — accepted limitation).

Closes apache#21112

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@etiennepelissier etiennepelissier force-pushed the implement-statistics-support-for-substrait branch 2 times, most recently from 07736eb to 020fd13 Compare March 25, 2026 14:01
Serialize `num_rows` and `total_byte_size` from a `TableProvider` into
`RelCommon.hint.stats` on the producer side, and inject them back into
the resolved provider on the consumer side when the provider has no
statistics of its own.

- Producer: writes `row_count` and `record_size` (= bytes / rows) into
  `RelCommon.hint.stats` for every `TableScan`.
- Consumer: exposes a new `hints: SubstraitHints` parameter on
  `SubstraitConsumer::resolve_table_ref`; `DefaultSubstraitConsumer`
  wraps the provider in a `StatisticsOverrideTableProvider` when hints
  are present and the provider lacks the corresponding fields.
- `SubstraitHints` is `#[non_exhaustive]` so future fields can be added
  without further breaking changes.
- Hints are advisory: `Exact` statistics become `Inexact` after a
  round-trip.  Tables with exactly 0 rows lose their stats (proto3
  default-zero ambiguity — accepted limitation).
- A `record_size`-only hint (no `row_count`) cannot reconstruct
  `total_byte_size` and is silently ignored; this is tested explicitly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@etiennepelissier etiennepelissier force-pushed the implement-statistics-support-for-substrait branch from 020fd13 to 5933e38 Compare March 25, 2026 14:03
@etiennepelissier etiennepelissier marked this pull request as ready for review March 25, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement statistics support for Substrait

2 participants