Skip to content

Optimization of COUNT on top of SEMI join to NDISTINCT on the RHS#507

Open
john-sanchez31 wants to merge 9 commits intomainfrom
John/count_opt
Open

Optimization of COUNT on top of SEMI join to NDISTINCT on the RHS#507
john-sanchez31 wants to merge 9 commits intomainfrom
John/count_opt

Conversation

@john-sanchez31
Copy link
Copy Markdown
Contributor

Resolves #491

) -> RelationalNode:
"""

This function optimize COUNT on top of SEMI join to NDISTINCT on the right
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix docstring (include INNER join)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to do this


assert isinstance(lhs_key, ColumnReference)

if agg_input.join_type in (JoinType.SEMI, JoinType.INNER):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some refactorization can be done here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reminder or a followup?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a reminder

@@ -1,6 +1,4 @@
ROOT(columns=[('n', n_rows)], orderings=[])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good example of the optimization

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for this and the next one should be updated in the test - should NOT optimize, stays SEMI

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those comments are not related to this optimization, should I update those anyways?

@john-sanchez31 john-sanchez31 marked this pull request as ready for review April 1, 2026 21:13
@john-sanchez31 john-sanchez31 requested review from a team, hadia206, juankx-bodo and knassre-bodo and removed request for a team April 1, 2026 21:13
@john-sanchez31 john-sanchez31 changed the title Optimization of COUNT on to of SEMI join to NDISTINCT on the RHS Optimization of COUNT on top of SEMI join to NDISTINCT on the RHS Apr 6, 2026
Copy link
Copy Markdown
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

I don’t see change to these tests that are mentioned in the issue filter_count_15, filter_count_16, general_join_02 , patient_claims and has_cross_correlated_singular

Let’s add the test mentioned in the issue as well

# How many customers in the building market segment have made an urgent order in 1994?

selected_orders = orders.WHERE((order_priority == '1-URGENT') & (YEAR(order_date) == 1994))
result = TPCH.CALCULATE(
  n=COUNT(customers.WHERE((market_segment == 'BUILDING') & HAS(selected_orders))
)

@@ -1,6 +1,4 @@
ROOT(columns=[('n', n_rows)], orderings=[])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for this and the next one should be updated in the test - should NOT optimize, stays SEMI

) -> RelationalNode:
"""

This function optimize COUNT on top of SEMI join to NDISTINCT on the right
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to do this

- The join has a reverse cardinality that always matches.

Args:
`node`: The node being transformed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing input_unique_sets

"rewrite_count_sf_pa",
),
id="rewrite_count_sf_pa",
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a test for where an INNER join that triggers the NDISTINCT rewrite

else:
return node

assert isinstance(lhs_key, ColumnReference)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? If True body statement guarantees its assignment and the false will return and not reach here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was added because of pre-commit complaining about the types


assert isinstance(lhs_key, ColumnReference)

if agg_input.join_type in (JoinType.SEMI, JoinType.INNER):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reminder or a followup?

# be present in the output.
required_columns = set(node.keys.keys())
elif isinstance(node, Join) and self._keep_condition_columns:
# For join this avoids prunning columns required for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
# For join this avoids prunning columns required for
# For join this avoids pruning columns required for

Copy link
Copy Markdown
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done John! A few things to potentially address, but this is close to being done :)

Comment on lines +856 to +862
# Aggregate must contain exactly one aggregation: COUNT(*)
if len(node.aggregations) != 1:
return node

((agg_key, agg_value),) = node.aggregations.items()
if agg_value.op != pydop.COUNT or agg_value.inputs:
return node
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially be a followup, but there is a way to extend this beyond just COUNT(*). We can allow multiple aggregations, as long as ALL of them obey the following rules:

  • If it is not COUNT(*), then it can only reference columns from the RHS
  • If it is not COUNT(*), then it has to be one of the functions that is not affected by having its rows duplicated by the join: MIN, MAX, ANYTHING, NDISTINCT

So we'd run the optimization as long there is at least 1 aggregation, ALL of the aggregations meet those criteria. COUNT(*) if present would still get transformed to NDISTINCT, but the others are more straightforward transformations (MIN(expr) becomes MIN(add_input_name(join.columns[expr.name], None)))

self._correl_dispatcher = RelationalExpressionDispatcher(
self._correl_finder, recurse=False
)
self._keep_condition_columns = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._keep_condition_columns = False
self._keep_condition_columns = False
"""
A boolean toggle indicating whether to maintain the columns used in the
condition of a Join node in the output of the Join node even if they are
unused by the Join node's parent node. If False, the columns in the condition
will not be maintained in the Join node's columns unless they need to be.
"""

# side of the join are the join keys. This will make some joins redundant
# and allow them to be deleted later. Then, re-run column pruning.
root = confirm_root(join_key_substitution(root))
root = pruner.prune_unused_columns(root)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought: we may be able to affect more tests with this optimization (particularly the ones from the MASKED tables) if we do an additional round of root = remove_redundant_aggs(root) around here (after the bubbling, but BEFORE pullup projections).

The reason is that the pullup step embeds function calls inside JOIN conditions, which will block your optimization from ocurring. If we run it again before pullup happens, it may trigger more often.

Comment on lines +266 to +268

if isinstance(node, Aggregate):
node = rewrite_count_ndistinct(node, input_uniqueness)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize COUNT on top of SEMI join to NDISTINCT on the right hand side

3 participants