Skip to content

[SQL-2712] Update IN translation to use $in#144

Open
jpowell-mongo wants to merge 27 commits into
mongodb:mainfrom
jpowell-mongo:jp/SQL-2712-in-operator
Open

[SQL-2712] Update IN translation to use $in#144
jpowell-mongo wants to merge 27 commits into
mongodb:mainfrom
jpowell-mongo:jp/SQL-2712-in-operator

Conversation

@jpowell-mongo
Copy link
Copy Markdown
Collaborator

@jpowell-mongo jpowell-mongo commented Apr 28, 2026

Summary

Note: Tests contribute to most of the line count, implementation code is much smaller.

This change updates mongosql to preserve the IN operator in SQL queries rather than translate them into a disjunction of ORs. Index Utilization is preserved.

Before, our rewriters would translate the query to a disjunction of ORs:

-- original query
SELECT a IN (b, c, d)
-- query after rewrites
SELECT a = b OR a = c OR a = d

Now, the code will preserve the IN operator, and later translate the code to the corresponding mql operator:

-- original query
SELECT a IN (b, c, d)
-- query after rewrites
SELECT a IN (b, c, d)

Along with these changes, I've made changes throughout the rest of the code, to allow IN / NOT IN to be translated to $in and { "$not": { "$in": [...] } } expressions in match stage aggregation pipelines, and use $in and $nin in $expr expressions.

./target/debug/mongosql-cli "SELECT * FROM nullable_fields WHERE a IN (100, 250, 322)"

Will produce this aggregation pipeline:

[
    { "$match": { "a": { "$in": [100, 250, 322] } } },
    { "$project": { "nullable_fields": "$$ROOT", "_id": 0 } },
]

Changes

  • Introduce In and NotIn enums to mir, air, ast, and codegen codepaths
  • Introduce a Tuple enum type throughut mir, air, and codegen. It operates the same as the Array type but I felt like it made it clearer to understand the cases.
  • Remove the InTupleRewriteVisitorPass, update the SingleTupleRewriter with a special case to not unwrap single element IN expressions (ex: SELECT * FROM orders where status IN ('pending'))
  • General threading of the new In, NotIn, and Tuple logic through the codebase

Testing

  • Unit Tests added at each step of the compiler, mir, air, codegen.
  • Index Utilization test to validate index usage still works.
  • Validated that the SQL from related Looker ticket works as expected.
  • Manual testing to validate $in expressions are used.

Comment thread mongosql/src/mir/definitions.rs
Comment thread mongosql-cli/src/main.rs
@jpowell-mongo jpowell-mongo marked this pull request as ready for review May 7, 2026 21:20
@jpowell-mongo jpowell-mongo requested a review from a team as a code owner May 7, 2026 21:20
@jpowell-mongo jpowell-mongo changed the title [DRAFT] [WORK IN PROGRESS] Update IN translation to use $in [SQL-2712] Update IN translation to use $in May 7, 2026
Copy link
Copy Markdown
Collaborator

@mattChiaravalloti mattChiaravalloti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it through everything except translator and codegen. I'm getting nervous I'll lose these big comments at this point so I'm submitting this partial review now. I'll follow up with thoughts on translator and codegen. This is a very large change! Nice job getting to all of the areas that need to be addressed. There is more work to do to get this generally correct but we're on a good path here.

current_db: index_usage_filter
query: "SELECT * FROM nullable_fields WHERE a > 100 AND b IN (SELECT b FROM non_nullable_fields WHERE a < 300)"
expected_utilization: IX_SCAN
- description: in operator uses the index
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- description: in operator uses the index
- description: in operator uses the index

[nit] All other tests in this file have a newline between them, so I suggest adding one here as well

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Questions/Thoughts] Should we also add any of the following tests?

  • NOT IN for nullable_fields should use an IX_SCAN
  • IN for non_nullable_fields should use an IX_SCAN I think... you can check that against mongodb manually and then add the test if you observe index utilization is possible in this case (I suspect it is)
  • NOT IN for non_nullable_fields should use an IX_SCAN (again, check this and if it does then we should add the test so we don't regress on this support in the future)

Comment thread mongosql/src/ast/pretty_print.rs
let op = BinaryOp::arbitrary(g);
// The parser always produces a Tuple on the RHS of IN/NOT IN, so arbitrary
// generation must match that shape to satisfy the round-trip property.
let right = if op == BinaryOp::In || op == BinaryOp::NotIn {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] Smart!


#[cfg(test)]
mod test;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Is there a reason you got rid of this newline?

Comment thread mongosql/src/algebrizer/definitions.rs Outdated
}),
);

test_algebrize!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Is this the same test as the one from binary.rs? Does it need to live in both places? I think the test file names correspond to input types, not output types, so this test is better placed in binary.rs. We can likely remove this here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question still stands. My suggestion is that we remove this test; this PR should not need to make changes to this file.

Comment thread mongosql/src/mir/optimizer/match_null_filtering/mod.rs
Comment on lines +124 to +128
/// Rewrites `IN` and `NOT IN` scalar functions to native match language when the
/// left-hand side is a field access and every value in the list is a literal.
///
/// Any other shape (subqueries, field refs in the list, functions) falls through
/// to `None` so the expression stays in `ExprLanguage`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] Nice! Good thinking here, and the tests for this look great too.

Comment thread mongosql/src/air/definitions.rs Outdated
Copy link
Copy Markdown
Collaborator

@mattChiaravalloti mattChiaravalloti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of my thoughts -- this time focused on translator and codegen (and, implicitly, air::desugarer).

};

let expression = match in_op.input {
None => return Err(Error::InvalidMatchLanguageInputRef),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Why would this ever be None? Can we update mir::MatchLanguageIn.input to be FieldPath instead of an Option<FieldPath>? Looking at the optimization that creates mir::MatchLanguageIn, there doesn't seem to be a need to make it an option since it is always set to Some(field_path).

Comment on lines +528 to +529
pub struct
FieldRef {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct
FieldRef {
pub struct FieldRef {

[nit/suggestion/question] Why the newline?

// $nin is match-language-only and rejected by the server in $project; NotIn must
// emit { "$not": [{ "$in": [...] }] } via SqlSemanticOperator codegen.
In => ScalarFunctionType::Sql(SqlOperator::In),
NotIn => ScalarFunctionType::Sql(SqlOperator::NotIn),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought] (blocking) I'm thinking through how algrebrization, translation, and codegen will work for these operators. It seems that we may still have additional work to do in the algebrization step which will impact these other steps.

In particular, we need to determine if the algebrized expression is nullable -- right now, we're doing that the way every other binary op does it but that's wrong! Since IN/NOT IN deal with an expr and an array of exprs, we determine nullability by either the LHS being nullable or any element of the RHS being nullable, right? nullable_a IN (1, 2, 3) is possibly null, non_nullable_a IN (1, 2, NULL, 3) is possibly null, non_nullable_a IN (1, 2, 3) is never null, and nullable_a IN (1, 2, NULL, 3) is possibly null. We'll need to update algebrization to account for this. Then, when we're doing translation here, we'll follow the correct path to translate these as SQL semantic or MQL semantic operators (if the scalar function is not nullable, we can translate to the MQL version, if it is nullable we need to translate to the SQL version).

On top of that, we may need to deal with the SQL version in the desugarer. Specifically, a IN (1, 2, NULL, 3) should be true if a is 1, 2, or 3, or it should be null if it itself is null or if it is not 1, 2, or 3 (because the result of the third comparison is null). However, as currently implemented we will just codegen {$in: ["$a", [1, 2, null, 3]]} and not get the expected null result when we should. The desugarer could handle this as part of the sql_null_semantics_operators pass.

Ultimately, the reason I'm commenting this here is because I think we'll need to update this translation util function to return MqlOperator::In/MqlOperator::NotIn in the even the function itself is not nullable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the algebrizer and parsing logic here:
6d16cde

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the desugarer change in another PR

);
}

mod in_operator {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Chore] We'll need to add tests for when the functions are nullable (is_nullable: true) -- these tests only handle when it that value is false.

translator::{self, test::match_query::mir_field_input},
};

fn make_translator() -> translator::MqlTranslator {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] This helper function, along with the test functions themselves, look identical to the implementation of the test_translate_match_query macro. Is there any difference between the two? Is there a reason you didn't use the macro?

Comment thread mongosql/src/codegen/match_query.rs
@mattChiaravalloti
Copy link
Copy Markdown
Collaborator

One last idea is that we may want to update the semantics.md doc to reflect these changes. It should no longer be considered part of the spec that X IN Y is rewritten to X = ANY(Y), etc. This section in particular covers this. We may also want to add some spec query tests to demonstrate how IN and NOT IN work.

Copy link
Copy Markdown
Collaborator

@mattChiaravalloti mattChiaravalloti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as approved even though I am leaving a ton of comments. I just don't want to block this from merging while I'm gone.

There is still work to be done even though I marked as approved:

  • all outstanding open comment threads should either result in code changes or have answers to them that explain why not (I resolved the ones that are addressed in one way or another)
    • A lot of the threads are still just some bookkeeping changes like fixing test names and/or inputs, addressing odd newlines, etc.
  • still need to add the spec/query tests as discussed offline
  • still need to update the desugarer before this can merge since it won't behave correctly until we do (we'll see the spec/query tests fail without this work done)

Feel free to tag someone else in in addition to Nick to give this another thorough pass if you want!

}),
);

test_algebrize!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question still stands. My suggestion is that we remove this test; this PR should not need to make changes to this file.

env = map! {
("foo", 1u16).into() => Schema::Document( Document {
keys: map! {
"a".into() => Schema::Atomic(Atomic::String),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"a".into() => Schema::Atomic(Atomic::String),
"a".into() => Schema::Atomic(Atomic::Integer),

We still need to update this or else it will ultimately fail when schema checking is implemented correctly.

env = map! {
("foo", 1u16).into() => Schema::Document( Document {
keys: map! {
"a".into() => Schema::Atomic(Atomic::String),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"a".into() => Schema::Atomic(Atomic::String),
"a".into() => Schema::Atomic(Atomic::Integer),

Same issue applies here.

env = map! {
("foo", 1u16).into() => Schema::Document( Document {
keys: map! {
"a".into() => Schema::Atomic(Atomic::String),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"a".into() => Schema::Atomic(Atomic::String),
"a".into() => Schema::Atomic(Atomic::Boolean),

Same issue applies here as well.

Comment on lines +881 to +890
env = map! {
("foo", 1u16).into() => Schema::Document( Document {
keys: map! {
"a".into() => Schema::Atomic(Atomic::Integer),
},
required: set!{},
additional_properties: false,
..Default::default()
}),
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
env = map! {
("foo", 1u16).into() => Schema::Document( Document {
keys: map! {
"a".into() => Schema::Atomic(Atomic::Integer),
},
required: set!{},
additional_properties: false,
..Default::default()
}),
},

Since this test uses all Literals, we don't even need to include an env.

@@ -1 +1 @@
/// Optimizes IS and LIKE expressions such that they can be translated and
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] We should update this comment a bit to reflect that this now supports IN/NOT IN as well.

use super::*;

test_schema!(
in_operator_requires_two_args,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in_operator_requires_two_args,
not_in_operator_requires_two_args,

[Question/Suggestion] Should we name these for in and not_in distinctly?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question applies throughout this file but I'll only comment it once.

);

test_schema!(
basic_in_operator_schema_is_boolean,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question still stands, thought there's a typo in my suggestion. Should be not_in_operator_schema_is_boolean.

/// Contains an `unreachable!()` branch for the non-array RHS match arm. This branch is
/// structurally unreachable: the preceding guard already returns
/// [`Error::SchemaChecking`] for any RHS that does not satisfy `Schema::Array(Any)`.
fn get_in_operator_comparison_schema(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] The implementation of the method is fantastic!

use crate::{air, mapping_registry::MqlMappingRegistryValue};

test_translate_expression!(
it_converts_to_mql_when_lhs_is_nullable,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] (blocking) Shouldn't it be that if either side is nullable it translates to SQL? And when both are not nullable it translates to MQL?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really just the test name that is misleading, and the input that is invalid. If either operand is marked with is_nullable: true, the entire expression should also be marked with is_nullable: true. This test sets is_nullable: false for the overall In expression, but marks one of its operands, the LHS, with is_nullable: true.

Can you fix this test to be valid to demonstrate that when none of the args is nullable, the translation is MQL. And then a separate test that shows when either arg is nullable (and therefore the expression is nullable) it translates to SQL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants