feat(pruning): add StatisticsSource trait with two-phase resolve/evaluate API#21157
Draft
adriangb wants to merge 2 commits intoapache:mainfrom
Draft
feat(pruning): add StatisticsSource trait with two-phase resolve/evaluate API#21157adriangb wants to merge 2 commits intoapache:mainfrom
adriangb wants to merge 2 commits intoapache:mainfrom
Conversation
…uate API Introduces a new expression-based statistics API for pruning that separates async data resolution from sync predicate evaluation. - StatisticsSource trait: accepts &[Expr], returns Vec<Option<ArrayRef>> - ResolvedStatistics: HashMap<Expr, ArrayRef> cache for pre-resolved stats - PruningPredicate::evaluate(): sync evaluation against pre-resolved cache - PruningPredicate::all_required_expressions(): exposes needed Expr list - Blanket impl bridges existing PruningStatistics implementations - prune() refactored to delegate through resolve_all_sync + evaluate This enables async statistics sources (external metastores, runtime sampling) while keeping the evaluation path synchronous for use in Stream::poll_next() contexts like EarlyStoppingStream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix broken intra-doc links for Expr, ResolvedStatistics, PruningPredicate - Replace deprecated Expr::Wildcard with Expr::Literal in count expressions - Fix clippy: collapsible if, bool_assert_comparison, uninlined_format_args, cloned_ref_to_slice_refs - Fix unused variable warning in test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
StatisticsSourcetrait: an expression-based async statistics API that accepts&[Expr]and returnsVec<Option<ArrayRef>>ResolvedStatistics: aHashMap<Expr, ArrayRef>cache that separates async data resolution from sync predicate evaluationPruningPredicate::evaluate(): sync evaluation against pre-resolved stats cachePruningStatisticsimplementations automaticallyprune()to delegate throughresolve_all_sync()+evaluate(), validating the two-phase pattern end-to-endDesign
The core idea is a two-phase resolve/evaluate split:
PruningPredicate::all_required_expressions()exposes what stats are needed asVec<Expr>. The caller passes these toStatisticsSource::expression_statistics(), which returns arrays packaged into aResolvedStatisticscache.PruningPredicate::evaluate(&ResolvedStatistics)looks up each required expression in the cache, null-fills missing entries (conservative — won't prune), builds aRecordBatch, and evaluates the predicate.This keeps the evaluation path synchronous for
Stream::poll_next()contexts likeEarlyStoppingStream, while allowing the resolution step to be async.Future work
Struct field pruning (#21003)
Because
StatisticsSourceaccepts arbitraryExpr, a custom implementation can handle expressions likemin(get_field(struct_col, 'field'))by resolving nested Parquet column statistics directly. The blanket impl onPruningStatisticsreturnsNonefor these (it only handles flatExpr::Columnargs), but a Parquet-awareStatisticsSourceimpl can override this. No further API changes needed — the expression language is already rich enough.Async statistics sources
The async
StatisticsSourcetrait enables use cases like querying an external metastore for statistics or sampling data at runtime. The two-phase pattern means callers resolve once (async) and evaluate many times (sync), which works well for dynamic filter scenarios where the predicate changes but the underlying data statistics don't.Cardinality estimation
StatisticsSourcecould sit onExecutionPlannodes via a method likepartition_expression_statistics(&[Expr]), delegating throughDataSourceExec→FileScanConfig→FileSource→ format-specific impl. This would enable queries likeapprox_count_distinct(col)for join optimization.There is work in progress to add NDV statistics to Parquet but this could unlock things like extracting stats from sampled data.
Test plan
datafusion-datasource-parquetcompiles unchanged🤖 Generated with Claude Code