Optimization of COUNT on top of SEMI join to NDISTINCT on the RHS by john-sanchez31 · Pull Request #507 · bodo-ai/PyDough

john-sanchez31 · 2026-03-27T16:13:43Z

Resolves #491

john-sanchez31 · 2026-04-01T19:54:29Z

pydough/relational/rel_util.py

+) -> RelationalNode:
+    """
+
+    This function optimize COUNT on top of SEMI join to NDISTINCT on the right


Fix docstring (include INNER join)

Reminder to do this

john-sanchez31 · 2026-04-01T20:56:34Z

pydough/relational/rel_util.py

+
+    assert isinstance(lhs_key, ColumnReference)
+
+    if agg_input.join_type in (JoinType.SEMI, JoinType.INNER):


Some refactorization can be done here

Is this a reminder or a followup?

It's a reminder

john-sanchez31 · 2026-04-01T21:07:14Z

tests/test_plan_refsols/redundant_has_on_plural.txt

@@ -1,6 +1,4 @@
-ROOT(columns=[('n', n_rows)], orderings=[])


Good example of the optimization

The comment for this and the next one should be updated in the test - should NOT optimize, stays SEMI

I think those comments are not related to this optimization, should I update those anyways?

hadia206

Overall looks good.

I don’t see change to these tests that are mentioned in the issue filter_count_15, filter_count_16, general_join_02 , patient_claims and has_cross_correlated_singular

Let’s add the test mentioned in the issue as well

# How many customers in the building market segment have made an urgent order in 1994?

selected_orders = orders.WHERE((order_priority == '1-URGENT') & (YEAR(order_date) == 1994))
result = TPCH.CALCULATE(
  n=COUNT(customers.WHERE((market_segment == 'BUILDING') & HAS(selected_orders))
)

hadia206 · 2026-04-06T23:14:11Z

tests/test_plan_refsols/redundant_has_on_plural.txt

@@ -1,6 +1,4 @@
-ROOT(columns=[('n', n_rows)], orderings=[])


The comment for this and the next one should be updated in the test - should NOT optimize, stays SEMI

hadia206 · 2026-04-06T23:15:08Z

pydough/relational/rel_util.py

+) -> RelationalNode:
+    """
+
+    This function optimize COUNT on top of SEMI join to NDISTINCT on the right


Reminder to do this

hadia206 · 2026-04-06T23:15:29Z

pydough/relational/rel_util.py

+        - The join has a reverse cardinality that always matches.
+
+    Args:
+        `node`: The node being transformed


missing input_unique_sets

hadia206 · 2026-04-06T23:18:07Z

tests/test_pipeline_tpch_custom.py

+                "rewrite_count_sf_pa",
+            ),
+            id="rewrite_count_sf_pa",
+        ),


Need a test for where an INNER join that triggers the NDISTINCT rewrite

hadia206 · 2026-04-06T23:19:15Z

pydough/relational/rel_util.py

+    else:
+        return node
+
+    assert isinstance(lhs_key, ColumnReference)


Is this needed? If True body statement guarantees its assignment and the false will return and not reach here

Was added because of pre-commit complaining about the types

hadia206 · 2026-04-06T23:19:54Z

pydough/relational/rel_util.py

+
+    assert isinstance(lhs_key, ColumnReference)
+
+    if agg_input.join_type in (JoinType.SEMI, JoinType.INNER):


Is this a reminder or a followup?

hadia206 · 2026-04-06T23:21:30Z

pydough/relational/relational_nodes/column_pruner.py

            # be present in the output.
            required_columns = set(node.keys.keys())
+        elif isinstance(node, Join) and self._keep_condition_columns:
+            # For join this avoids prunning columns required for


typo

Suggested change

# For join this avoids prunning columns required for

# For join this avoids pruning columns required for

knassre-bodo

Well done John! A few things to potentially address, but this is close to being done :)

knassre-bodo · 2026-04-08T17:50:49Z

pydough/relational/rel_util.py

+    # Aggregate must contain exactly one aggregation: COUNT(*)
+    if len(node.aggregations) != 1:
+        return node
+
+    ((agg_key, agg_value),) = node.aggregations.items()
+    if agg_value.op != pydop.COUNT or agg_value.inputs:
+        return node


This can potentially be a followup, but there is a way to extend this beyond just COUNT(*). We can allow multiple aggregations, as long as ALL of them obey the following rules:

If it is not COUNT(*), then it can only reference columns from the RHS

If it is not COUNT(*), then it has to be one of the functions that is not affected by having its rows duplicated by the join: MIN, MAX, ANYTHING, NDISTINCT

So we'd run the optimization as long there is at least 1 aggregation, ALL of the aggregations meet those criteria. COUNT(*) if present would still get transformed to NDISTINCT, but the others are more straightforward transformations (MIN(expr) becomes MIN(add_input_name(join.columns[expr.name], None)))

knassre-bodo · 2026-04-08T17:53:40Z

pydough/relational/relational_nodes/column_pruner.py

        self._correl_dispatcher = RelationalExpressionDispatcher(
            self._correl_finder, recurse=False
        )
+        self._keep_condition_columns = False


Suggested change

self._keep_condition_columns = False

self._keep_condition_columns = False

"""

A boolean toggle indicating whether to maintain the columns used in the

condition of a Join node in the output of the Join node even if they are

unused by the Join node's parent node. If False, the columns in the condition

will not be maintained in the Join node's columns unless they need to be.

"""

knassre-bodo · 2026-04-08T17:57:49Z

pydough/conversion/relational_converter.py

    # side of the join are the join keys. This will make some joins redundant
    # and allow them to be deleted later. Then, re-run column pruning.
    root = confirm_root(join_key_substitution(root))
-    root = pruner.prune_unused_columns(root)


A thought: we may be able to affect more tests with this optimization (particularly the ones from the MASKED tables) if we do an additional round of root = remove_redundant_aggs(root) around here (after the bubbling, but BEFORE pullup projections).

The reason is that the pullup step embeds function calls inside JOIN conditions, which will block your optimization from ocurring. If we run it again before pullup happens, it may trigger more often.

knassre-bodo · 2026-04-08T17:58:13Z

pydough/conversion/agg_removal.py

+
+            if isinstance(node, Aggregate):
+                node = rewrite_count_ndistinct(node, input_uniqueness)


Add a comment here

john-sanchez31 added 7 commits March 27, 2026 10:11

base implementation

d9a5a7f

adding optimization updates

7117757

adding test refsol

5d4f580

adding keep_condition_columns

1c9d9c4

inner join case, all tests [run ci][run dialects]

6d6c05c

validation moved, refsol updated [run ci][run dialects]

feb32e2

INNER implementation [run ci][run dialects]

3338057

john-sanchez31 commented Apr 1, 2026

View reviewed changes

john-sanchez31 marked this pull request as ready for review April 1, 2026 21:13

john-sanchez31 requested review from a team, hadia206, juankx-bodo and knassre-bodo and removed request for a team April 1, 2026 21:13

john-sanchez31 changed the title ~~Optimization of COUNT on to of SEMI join to NDISTINCT on the RHS~~ Optimization of COUNT on top of SEMI join to NDISTINCT on the RHS Apr 6, 2026

hadia206 reviewed Apr 7, 2026

View reviewed changes

cleaning up, fixing comments

c86e173

knassre-bodo reviewed Apr 8, 2026

View reviewed changes

adding comments

c4246d9


		assert isinstance(lhs_key, ColumnReference)

		if agg_input.join_type in (JoinType.SEMI, JoinType.INNER):

	# For join this avoids prunning columns required for
	# For join this avoids pruning columns required for

-        self._keep_condition_columns = False
+        self._keep_condition_columns = False
+        """
+        A boolean toggle indicating whether to maintain the columns used in the
+        condition of a Join node in the output of the Join node even if they are
+        unused by the Join node's parent node. If False, the columns in the condition
+        will not be maintained in the Join node's columns unless they need to be.
+        """


		if isinstance(node, Aggregate):
		node = rewrite_count_ndistinct(node, input_uniqueness)

Conversation

john-sanchez31 commented Mar 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants