Optimization of COUNT on top of SEMI join to NDISTINCT on the RHS#507
Optimization of COUNT on top of SEMI join to NDISTINCT on the RHS#507john-sanchez31 wants to merge 9 commits intomainfrom
Conversation
pydough/relational/rel_util.py
Outdated
| ) -> RelationalNode: | ||
| """ | ||
|
|
||
| This function optimize COUNT on top of SEMI join to NDISTINCT on the right |
There was a problem hiding this comment.
Fix docstring (include INNER join)
pydough/relational/rel_util.py
Outdated
|
|
||
| assert isinstance(lhs_key, ColumnReference) | ||
|
|
||
| if agg_input.join_type in (JoinType.SEMI, JoinType.INNER): |
There was a problem hiding this comment.
Some refactorization can be done here
There was a problem hiding this comment.
Is this a reminder or a followup?
There was a problem hiding this comment.
It's a reminder
| @@ -1,6 +1,4 @@ | |||
| ROOT(columns=[('n', n_rows)], orderings=[]) | |||
There was a problem hiding this comment.
Good example of the optimization
There was a problem hiding this comment.
The comment for this and the next one should be updated in the test - should NOT optimize, stays SEMI
There was a problem hiding this comment.
I think those comments are not related to this optimization, should I update those anyways?
hadia206
left a comment
There was a problem hiding this comment.
Overall looks good.
I don’t see change to these tests that are mentioned in the issue filter_count_15, filter_count_16, general_join_02 , patient_claims and has_cross_correlated_singular
Let’s add the test mentioned in the issue as well
# How many customers in the building market segment have made an urgent order in 1994?
selected_orders = orders.WHERE((order_priority == '1-URGENT') & (YEAR(order_date) == 1994))
result = TPCH.CALCULATE(
n=COUNT(customers.WHERE((market_segment == 'BUILDING') & HAS(selected_orders))
)
| @@ -1,6 +1,4 @@ | |||
| ROOT(columns=[('n', n_rows)], orderings=[]) | |||
There was a problem hiding this comment.
The comment for this and the next one should be updated in the test - should NOT optimize, stays SEMI
pydough/relational/rel_util.py
Outdated
| ) -> RelationalNode: | ||
| """ | ||
|
|
||
| This function optimize COUNT on top of SEMI join to NDISTINCT on the right |
| - The join has a reverse cardinality that always matches. | ||
|
|
||
| Args: | ||
| `node`: The node being transformed |
| "rewrite_count_sf_pa", | ||
| ), | ||
| id="rewrite_count_sf_pa", | ||
| ), |
There was a problem hiding this comment.
Need a test for where an INNER join that triggers the NDISTINCT rewrite
| else: | ||
| return node | ||
|
|
||
| assert isinstance(lhs_key, ColumnReference) |
There was a problem hiding this comment.
Is this needed? If True body statement guarantees its assignment and the false will return and not reach here
There was a problem hiding this comment.
Was added because of pre-commit complaining about the types
pydough/relational/rel_util.py
Outdated
|
|
||
| assert isinstance(lhs_key, ColumnReference) | ||
|
|
||
| if agg_input.join_type in (JoinType.SEMI, JoinType.INNER): |
There was a problem hiding this comment.
Is this a reminder or a followup?
| # be present in the output. | ||
| required_columns = set(node.keys.keys()) | ||
| elif isinstance(node, Join) and self._keep_condition_columns: | ||
| # For join this avoids prunning columns required for |
There was a problem hiding this comment.
typo
| # For join this avoids prunning columns required for | |
| # For join this avoids pruning columns required for |
knassre-bodo
left a comment
There was a problem hiding this comment.
Well done John! A few things to potentially address, but this is close to being done :)
| # Aggregate must contain exactly one aggregation: COUNT(*) | ||
| if len(node.aggregations) != 1: | ||
| return node | ||
|
|
||
| ((agg_key, agg_value),) = node.aggregations.items() | ||
| if agg_value.op != pydop.COUNT or agg_value.inputs: | ||
| return node |
There was a problem hiding this comment.
This can potentially be a followup, but there is a way to extend this beyond just COUNT(*). We can allow multiple aggregations, as long as ALL of them obey the following rules:
- If it is not
COUNT(*), then it can only reference columns from the RHS - If it is not
COUNT(*), then it has to be one of the functions that is not affected by having its rows duplicated by the join:MIN,MAX,ANYTHING,NDISTINCT
So we'd run the optimization as long there is at least 1 aggregation, ALL of the aggregations meet those criteria. COUNT(*) if present would still get transformed to NDISTINCT, but the others are more straightforward transformations (MIN(expr) becomes MIN(add_input_name(join.columns[expr.name], None)))
| self._correl_dispatcher = RelationalExpressionDispatcher( | ||
| self._correl_finder, recurse=False | ||
| ) | ||
| self._keep_condition_columns = False |
There was a problem hiding this comment.
| self._keep_condition_columns = False | |
| self._keep_condition_columns = False | |
| """ | |
| A boolean toggle indicating whether to maintain the columns used in the | |
| condition of a Join node in the output of the Join node even if they are | |
| unused by the Join node's parent node. If False, the columns in the condition | |
| will not be maintained in the Join node's columns unless they need to be. | |
| """ |
| # side of the join are the join keys. This will make some joins redundant | ||
| # and allow them to be deleted later. Then, re-run column pruning. | ||
| root = confirm_root(join_key_substitution(root)) | ||
| root = pruner.prune_unused_columns(root) |
There was a problem hiding this comment.
A thought: we may be able to affect more tests with this optimization (particularly the ones from the MASKED tables) if we do an additional round of root = remove_redundant_aggs(root) around here (after the bubbling, but BEFORE pullup projections).
The reason is that the pullup step embeds function calls inside JOIN conditions, which will block your optimization from ocurring. If we run it again before pullup happens, it may trigger more often.
|
|
||
| if isinstance(node, Aggregate): | ||
| node = rewrite_count_ndistinct(node, input_uniqueness) |
Resolves #491