feat(rust): Vortex Phase 2 — AExpr-direct filter pushdown (stacked on #2)#1
feat(rust): Vortex Phase 2 — AExpr-direct filter pushdown (stacked on #2)#1lwwmanning wants to merge 72 commits into
Conversation
|
The uncompressed lib size after this PR is 59.4950 MB. |
1 similar comment
|
The uncompressed lib size after this PR is 59.4950 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
4 similar comments
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.3440 MB. |
|
The uncompressed lib size after this PR is 59.5096 MB. |
1 similar comment
|
The uncompressed lib size after this PR is 59.5096 MB. |
|
The uncompressed lib size after this PR is 59.5094 MB. |
|
The uncompressed lib size after this PR is 59.5093 MB. |
1 similar comment
|
The uncompressed lib size after this PR is 59.5093 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
4 similar comments
|
The uncompressed lib size after this PR is 59.5095 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
…le-3 should-fix carry-forward items
…x + Dedicated single-cache refactor)
…_commit, headline)
|
The uncompressed lib size after this PR is 59.4899 MB. |
…htly, ruff, dprint) - vortex_convertor.rs: 4 let-else → ? rewrites; collapse Boolean function match - README.md: dprint fmt paragraph reflow + table column padding - test_vortex.py: ruff format (line-length-driven paren wrap and unwrap)
…(virtual-column per-column split)
User picked Step 3.4 Amend at the Phase 2→3 boundary; the cycle-1 4-vote
phase-end review accepted, but two Deferred work items are ready to land
as small follow-on PRs before declaring Phase 2 done:
- PR-2.7: port the 6 cutover-lost pushdown shapes (is_between, is_in,
starts_with, ends_with, Contains{literal:true}, Ternary) from the
deleted legacy SpecializedColumnPredicate path into the AExpr-direct
convertor. ~65 LoC + tests.
- PR-2.8: refactor the virtual-column guard at lower_ir.rs:780-815 from
full-refusal to per-column file-vs-virtual split, mirroring
polars-mem-engine's hive_predicate extraction pattern. ~60 LoC + tests.
Phase 2 phase_entry_sha (93643dd) stays unchanged; phase_end_cycle
stays at 1 — the next phase-end review will fire as cycle 2 against the
full cumulative phase diff (original PR-2.0..PR-2.6 + PR-2.7 + PR-2.8).
4-vote phase-end preset retained.
Also captures the Phase 1 + 2 CI lint sweep in last_user_touchpoint_what
(dprint reflow + ruff format + clippy-nightly question_mark/collapsible_match
+ typos exclude of .big-plans/ + PR title fixes); all lint checks green
on both stacked PRs as of 2026-05-18 22:00.
|
The uncompressed lib size after this PR is 59.4899 MB. |
Six shapes previously handled by the deleted legacy
SpecializedColumnPredicate path (removed in PR-2.6's Option B → A cutover)
are now reachable through the AExpr-direct convertor:
- is_between(lo, hi, closed) — decomposed to (col gt[_eq] lo) AND
(col lt[_eq] hi) per ClosedInterval variant. Manual decomposition
avoids importing Vortex's BetweenOptions (lives in vortex::scalar_fn,
outside polars-vortex's narrowed re-export at lib.rs:24).
- is_in([...]) — decomposed to OR of equalities; reuses the
polars-plan-internal try_extract_is_in_haystack helper to share the
extraction logic with the deleted SpecializedColumnPredicate path.
Refuses pushdown when nulls_equal=true AND the haystack had nulls
(polars_scalar_to_vortex can't emit a null scalar without dtype
context; safer to refuse than narrow the predicate).
- str.starts_with(prefix) → like(col, lit('prefix%'))
- str.ends_with(suffix) → like(col, lit('%suffix'))
- str.contains(sub, literal=True) → like(col, lit('%sub%')). literal=False
is regex (refused; Vortex's LIKE is not regex).
- Ternary { predicate, truthy, falsy } → case_when(condition, then, else).
The bytes_to_like_literal helper is resurrected verbatim from the deleted
legacy predicate.rs and lives inline in vortex_convertor.rs so polars-plan
owns its own LIKE-pattern escaping (no cross-crate plumbing). Refuses if
the needle contains LIKE special characters (% _ \\) — widening would be
sound under PARTIAL_FILTER but defeats the pushdown's perf win.
The new arms are inserted BEFORE the existing IRBooleanFunction(boolean_fn)
catchall arm so Rust pattern-matching ordering respects more-specific-first
(otherwise the catchall would match first and return None via its inner _).
Verification:
cargo test -p polars-plan --features vortex,is_between,is_in,regex,strings,dtype-struct vortex_convertor
→ 72 passed (9 new shape tests covering positive + refuse paths)
cargo check -p polars --features vortex,cloud,parquet,dtype-full → clean
cargo check -p polars-plan --features vortex → clean (no-strings build)
cargo check -p polars-stream --features vortex,cloud → clean (umbrella-equiv;
vortex-alone has a pre-existing feature gap per round-2 spec Pack A2)
…shdown table refresh
- py-polars/tests/unit/io/test_vortex.py: 7 new e2e tests covering all 6
newly-pushable shapes (is_between with default + closed='left' kwarg;
is_in; starts_with; ends_with; contains{literal:true}; Ternary; plus
a negative-path starts_with-wildcard-in-needle test verifying the
bytes_to_like_literal escape guard).
- crates/polars-vortex/README.md: Pushdown coverage table moves the 6
cutover-lost shapes from '❌ residual' to '✅ AExpr-direct convertor
(PR-2.7)' with the specific Vortex builder noted. Known limits removes
the 'Cutover-lost pushdown shapes' bullet (now resolved); dprint fmt
reflowed the table column widths.
Consolidates the amend-completion + Step 2.3-step-2 awaiting-review transitions for PR-2.7 (implementation commits 065df50 + 0e2c26f landed ahead of this plan update — minor ordering drift from the canonical inner-loop sequence; substance unchanged). Current State now reflects: - status: awaiting-review - current_pr: PR-2.7, pr_index: 8 (post-amend, PR-2.7 is the 8th in Phase 2) - planning_sub_flow: null (amend sub-flow closed) - last_phase_end_verdict: null (reset per amend semantics; next phase-end review fires as cycle 2 of Phase 2 after PR-2.7 + PR-2.8 complete) - phase_entry_sha unchanged (93643dd) so cumulative-phase reviews still span original PR-2.0..PR-2.6 + amend PR-2.7 + PR-2.8
|
The uncompressed lib size after this PR is 59.4899 MB. |
Both reviewers (fresh + correctness) REJECT with high confidence. Three must-fix correctness gates of the same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType: (1) is_between arm lacks pairwise-PType gate (vortex_convertor.rs:319-340); (2) Ternary arm lacks THEN/ELSE pairwise-dtype gate (vortex_convertor.rs:563-572); (3) StringExpr arm lacks Utf8 input gate (vortex_convertor.rs:471-510). All three violate always-SAFE-fallback contract. Severity dedupe on Ternary disagreement = HIGHEST per spec (fresh: must-fix, correctness: should-fix with reduced-reach rationale). 6 should-fix + 2 nit also flagged. Entering Step 2.4 fix-application loop.
<details><summary>gauntlet Synthesizer Output (schema_version: 1)</summary>
```json
{
"schema_version": 1,
"preset": "pr-2",
"lenses_used": ["fresh", "correctness"],
"review_count": 2,
"unified_findings": [
{"severity": "must-fix", "kind": "bug", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:319-340", "description": "is_between arm lacks pairwise-PType gate between col and bounds. The arm calls Vortex's gt/gt_eq/lt/lt_eq builders DIRECTLY (not via AExpr::BinaryExpr recursion), so the BinaryExpr arm's pairwise-PType gate is NEVER invoked. The comment at L312-318 falsely claims 'schema-gated only transitively' via the BinaryExpr arm. Cross-PType bounds (e.g., Int32 col with Int64 literal bounds when TYPE_COERCION is off) will vortex_bail! at scan-time in Vortex's Binary::return_dtype (vortex-array/src/scalar_fn/fns/binary/mod.rs:130-136) with 'Cannot compare different DTypes', violating the always-SAFE-fallback contract. Same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType must-fixes. Highly reachable: column dtypes are diverse and Rust literal types easily mismatch.", "recommended_fix": "Add pairwise gate BEFORE building comparisons: `let s = schema?; let col_dt = resolve_inner_dtype(input[0].node(), arena, s)?; let lo_dt = resolve_inner_dtype(input[1].node(), arena, s)?; let hi_dt = resolve_inner_dtype(input[2].node(), arena, s)?; if col_dt != lo_dt || col_dt != hi_dt { return None; }`. Mirror the precedent in Plus / comparison gates for the type_coercion-off path. Add a unit test `shape_is_between_cross_ptype_returns_none` (col=Int32, lo/hi=Int64). After the gate lands, rewrite or delete the misleading transitivity comment.", "found_by": ["fresh", "correctness"]},
{"severity": "must-fix", "kind": "bug", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:563-572", "description": "Ternary arm lacks THEN/ELSE pairwise-dtype gate. Comment at L556-562 falsely claims 'Vortex's CaseWhen is type-checked at builder time' — inspection of vortex-array/src/expr/expression.rs:45-62 shows try_new only validates arity; the dtype unification fires in CaseWhen::return_dtype at scan/pruning time (vortex-array/src/scalar_fn/fns/case_when.rs:185-191) with 'CaseWhen THEN and ELSE dtypes must match'. Cross-dtype truthy/falsy (e.g., nested Ternary with col_int32 vs col_int64) recurse independently and bypass the gate. Same bug class as PR-2.2 cycle-1 Plus must-fix; violates always-SAFE-fallback contract. Practical reach reduced for a Ternary at the filter root (Polars validates it must be Boolean-typed), but nested Ternary inside a complex predicate when TYPE_COERCION is off remains reachable.", "recommended_fix": "Mirror Plus arm gate BEFORE recursing: `let s = schema?; let then_dt = resolve_inner_dtype(*truthy, arena, s)?; let else_dt = resolve_inner_dtype(*falsy, arena, s)?; if then_dt != else_dt { return None; }`. resolve_inner_dtype currently has no Ternary arm — recursion through it returns None which is the safe fallback. Fix the false comment to accurately describe the explicit gate. Add unit test `shape_ternary_cross_dtype_returns_none`.", "found_by": ["fresh", "correctness"]},
{"severity": "must-fix", "kind": "bug", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:471-510", "description": "StringExpr arm lacks Utf8 input gate on input[0]. Vortex's Like.return_dtype vortex_bail!s at scan-time if input dtype isn't Utf8 (vortex-array/src/scalar_fn/fns/like/mod.rs:124-132). If input[0] is non-String (e.g., a struct field access producing Int32, or a Cast whose source isn't String), pushdown emits a Vortex `like(<Int32-typed-expr>, <utf8-pattern>)` that fails at scan-time. Violates always-SAFE-fallback contract — same class as PR-2.3 CAST gate.", "recommended_fix": "Add Utf8 gate: `let s = schema?; let col_dt = resolve_inner_dtype(input[0].node(), arena, s)?; if col_dt != DataType::String { return None; }` BEFORE building the `like` call. Add unit test `shape_starts_with_on_int_column_returns_none`.", "found_by": ["fresh"]},
{"severity": "should-fix", "kind": "missing-acceptance", "file_line": "py-polars/tests/unit/io/test_vortex.py:330-489", "description": "Acceptance criterion (c) NOT MET: no test verifies pushdown actually engages. All 7 e2e tests assert only post-decode correctness, which the residual-reapply preserves regardless. A regression where the convertor returns None for all 6 new shapes would silently fall back to residual and still pass every test.", "recommended_fix": "Add at least one engagement assertion per shape — either inspect `pl.scan_vortex(...).filter(...).explain()` for the pushed Vortex expression, use display_tree() to verify the pushdown path, or POLARS_VERBOSE + caplog. Alternative: Rust integration test mirroring shape_plus_arithmetic_structural's structural-assertion discipline.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "coverage", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:1889-1947", "description": "is_in arm has ZERO unit tests. Test-module comment claims 'covered by e2e Python tests; arena-level Series literal construction is awkward' but the e2e tests don't cover: (a) nulls_equal=true + haystack-contains-null (regression of this gate would silently push a narrower filter); (b) per-scalar conversion failure; (c) empty haystack. is_between IS unit-tested, contradicting the 'awkward' rationale.", "recommended_fix": "Add unit tests: shape_is_in_positive (using Series::from), shape_is_in_nulls_equal_with_null_returns_none, shape_is_in_empty_haystack_returns_none, shape_is_in_per_scalar_conversion_failure_returns_none.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "coverage", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:1898-2089", "description": "Coverage gaps that would catch the must-fix bugs: (a) is_between cross-PType bounds; (b) Ternary cross-dtype; (c) StringExpr on non-String column; (d) is_between Right/None ClosedInterval variants (only Both/Left covered both at unit and e2e level); (e) bytes_to_like_literal wildcard test only covers '%', missing '_' and '\\\\'.", "recommended_fix": "Add tests alongside the must-fix gate work: shape_is_between_cross_ptype_returns_none, shape_ternary_cross_dtype_returns_none, shape_starts_with_on_int_column_returns_none, shape_is_between_right_inclusive, shape_is_between_none_inclusive, bytes_to_like_literal arm tests for '_' and '\\\\'.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "doc-quality", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:27-64", "description": "Module-level doc-comment stale post-PR-2.7. L27 still says '(PR-2.1 foundation + PR-2.2 + PR-2.3 + PR-2.4 extensions)'. L29 still says '16 shapes' — should be 22. Shape table (L33-50) missing 6 new shapes. 'What this module does NOT cover yet' paragraph (L52-64) explicitly lists 'Ternary' (L61) — now covered.", "recommended_fix": "Update doc to include PR-2.7 shapes in the table, bump '16 shapes' to '22 shapes', remove 'Ternary' from the not-covered list, remove 'is_between'/'is_in'/'str.starts_with'/'str.ends_with' caveats.", "found_by": ["fresh"]},
{"severity": "should-fix", "kind": "convention", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:311-318", "description": "Factually wrong comment: 'Schema-gated only transitively: the recursive aexpr_to_vortex_expression calls inherit the existing comparison pairwise-PType gate (BinaryExpr arm)'. False — the arm builds gt/gt_eq DIRECTLY via Vortex builders, never re-entering the AExpr::BinaryExpr arm. Misleads readers and is design root cause of must-fix #1.", "recommended_fix": "After adding the explicit pairwise gate, rewrite the comment to describe the explicit gate (not a transitive one). Or delete the misleading claim.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "convention", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:556-562", "description": "Comment falsely claims 'Vortex's CaseWhen is type-checked at builder time' AND 'recursion's existing CAST/Plus pairwise-PType gates protect against mismatched arms'. Both FALSE: case_when() builder only validates arity; inner Plus/CAST gates protect their own subtrees, not Ternary truthy-vs-falsy mismatch. Root cause of must-fix #2.", "recommended_fix": "Rewrite to describe the added gate. Cite case_when.rs:185-191 as the bail-at-scan-time site.", "found_by": ["fresh", "correctness"]},
{"severity": "nit", "kind": "doc-quality", "file_line": "crates/polars-vortex/README.md:240", "description": "is_in README entry mentions only the nulls_equal=true+nulls refuse path; doesn't mention the per-scalar conversion refuse (e.g., Duration scalars in haystack fail polars_scalar_to_vortex).", "recommended_fix": "Append 'or any haystack scalar fails Polars→Vortex conversion' to the is_in row's path-description cell.", "found_by": ["fresh"]},
{"severity": "nit", "kind": "encapsulation", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:855-875", "description": "bytes_to_like_literal is private; only '%' branch tested indirectly. Other branches ('_', '\\\\', invalid UTF-8) are dead from a test standpoint.", "recommended_fix": "Either make pub(super) for direct unit testing, or add three more arm-level negative tests using each unsafe byte: '_', '\\\\', and a needle that's not valid UTF-8.", "found_by": ["fresh"]}
],
"disagreements": [
{"topic": "Ternary cross-dtype gate severity (must-fix vs should-fix)", "positions": [{"lens": "fresh", "position": "must-fix — same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.2 cycle-1 Plus must-fixes. Violates always-SAFE-fallback contract. Should be held to the same bar as the other pairwise-dtype gates."}, {"lens": "correctness", "position": "should-fix — same bug class but with reduced practical reach. A Ternary at the filter root must be Boolean-typed (Polars validates) so the cross-dtype hazard is gated to nested Ternary inside a complex predicate when TYPE_COERCION is off."}], "synthesizer_call": "Hold at must-fix per spec rule (severity dedupe = HIGHEST) and per the existing precedent pattern (Plus/comparison gates were must-fix even though TYPE_COERCION often handles cross-PType in practice). Reduced reach is a mitigation argument, not a category change. Correctness's reachability rationale captured for implementer awareness."}
],
"dropped_re_flags": [],
"executive_summary": "Both reviewers independently converged on REJECT with high confidence. Three must-fix correctness gates of the same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType: (1) is_between arm lacks pairwise-PType gate; (2) Ternary arm lacks THEN/ELSE pairwise-dtype gate; (3) StringExpr arm lacks Utf8 input gate. All three violate the always-SAFE-fallback contract that PR-2.6's Option B->A cutover rests on. Fixes are 4-6 lines each, mirroring the Plus arm gate precedent. Two of the three are accompanied by misleading comments that mask the design errors and need rewriting alongside the gate fixes. Severity disagreement on Ternary resolved per spec to HIGHEST = must-fix. Coverage and engagement-assertion gaps reinforce the must-fix work: acceptance criterion (c) is NOT met; is_in arm has ZERO unit tests despite the 'awkward' rationale; is_between Right/None ClosedInterval variants are unexercised; cross-PType refusal tests for the three must-fix gates are missing. Verified design strengths: bytes_to_like_literal escape logic preserves legacy parity; ClosedInterval mapping correct; match arm order correct; cfg-gating consistent; is_in's haystack-null refuse decision sound; multi-byte UTF-8 needles handled cleanly. Total estimated work: 12-25 LoC + ~10 unit tests + doc updates.",
"overall": "reject",
"must_fix_count": 3,
"should_fix_count": 6,
"nit_count": 2,
"review_cycles_this_invocation": 1
}
```
</details>
…ype gate (review finding) The is_between arm constructs Vortex gt/gt_eq/lt/lt_eq builders DIRECTLY rather than re-entering AExpr::BinaryExpr, so the BinaryExpr arm's pairwise-PType gate would NEVER fire transitively. Cross-PType bounds (Int32 col with Int64 literal bounds when TYPE_COERCION is off) would vortex_bail! at scan-time in Vortex's Binary::return_dtype with 'Cannot compare different DTypes', violating the always-SAFE-fallback contract. Same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType must-fixes. Add an explicit pre-recursion pairwise-PType gate (4 lines) mirroring the BinaryExpr arm's pattern. Rewrite the misleading 'Schema-gated only transitively' comment to describe the explicit gate. Add 4 unit tests (shape_is_between_right_inclusive, shape_is_between_none_exclusive, shape_is_between_cross_ptype_returns_none, shape_is_between_without_schema_returns_none) which together close the cycle-1 should-fix #6d ClosedInterval coverage gap AND the cycle-1 should-fix #6a cross-PType test gap.
…wise-dtype gate (review finding) The Ternary arm fed truthy/falsy into Vortex's case_when builder without verifying they share a dtype. case_when's try_new builder only validates arity (vortex-array/src/expr/expression.rs:45-62); the dtype unification check fires later in CaseWhen::return_dtype (vortex-array/src/scalar_fn/fns/case_when.rs:185-191) which vortex_bail!s with 'CaseWhen THEN and ELSE dtypes must match' at scan/pruning time. Cross-dtype truthy/falsy (nested Ternary with col_int32 vs col_int64 when TYPE_COERCION is off) bypassed the inner arms' CAST/Plus gates, violating the always-SAFE-fallback contract. Same bug class as PR-2.2 cycle-1 Plus must-fix. Reachability: a Ternary at the filter root must be Boolean-typed (Polars validates), so the cross-dtype hazard is gated to nested Ternary inside a complex predicate. Held to must-fix per existing precedent (Plus / comparison gates) — always-SAFE-fallback contract is binary. Add an explicit pre-recursion THEN/ELSE pairwise-dtype gate (4 lines) mirroring the Plus arm pattern. Rewrite the misleading 'CaseWhen is type-checked at builder time' + 'CAST/Plus gates protect mismatched arms' comments to accurately describe the explicit gate. Add 2 unit tests (shape_ternary_cross_dtype_returns_none, shape_ternary_without_schema_returns_none).
… input gate (review finding)
Vortex's Like kernel (vortex-array/src/scalar_fn/fns/like/mod.rs:124-132
return_dtype) vortex_bail!s with 'Cannot apply like to non-Utf8 input' at
scan-time if input[0] isn't Utf8. The StringExpr arm fed input[0] directly
into the recursive convertor without verifying the column's dtype, so a
non-String input (struct field access producing Int32, Cast whose source
isn't String) would silently push a like(<non-utf8>, <utf8-pattern>) that
fails at scan-time, violating the always-SAFE-fallback contract. Same bug
class as PR-2.3 cycle-1 CAST cross-kind must-fix.
Add an explicit Utf8 input gate (3 lines) BEFORE any needle extraction or
pattern construction work. Update existing 4 StringExpr unit tests
(shape_starts_with_positive / _wildcard_in_needle / shape_ends_with_positive
/ shape_contains_literal_{true,false}) to pass schema (the gate now consults
schema?). Add 4 new unit tests: shape_starts_with_on_int_column_returns_none
(must-fix pola-rs#3 regression), shape_starts_with_without_schema_returns_none
(no-schema refuse), shape_starts_with_underscore_in_needle_returns_none and
shape_starts_with_backslash_in_needle_returns_none (close cycle-1 should-fix
#6e and nit pola-rs#11 coverage gap).
Also adds schema_s_string() and schema_s_string_i_int32() test helpers.
… Utf8 gate) — all 3 must-fix done; should-fix sweep next
- vortex_convertor.rs module-level doc-comment refresh (should-fix pola-rs#7): bumps shape count 16 -> 22, adds 6 PR-2.7 shapes to the table with their Vortex builder + gate notes, removes Ternary from the 'not covered yet' list, updates PR-2.5 reference to reflect the Deferred-work slip. - 5 structural-assertion unit tests for the new shapes (should-fix pola-rs#4 — acceptance criterion (c)): shape_is_between_structural, shape_starts_with_structural, shape_ends_with_structural, shape_contains_literal_true_structural, shape_ternary_structural. Each asserts the produced Vortex Expression's Display contains the expected builder anchor (and/like/case_when) plus the column/literal substrings. Catches paste-swap bugs the prior .is_some()-only tests would miss (mirrors the shape_plus_arithmetic_structural discipline from PR-2.2 cycle-1). is_in structural omitted — its Series-literal arena construction needs SpecialEq<Series> infrastructure not yet wired into the test helpers; e2e Python test covers correctness. - README.md is_in entry refinement (cycle-1 nit pola-rs#10): document the per-scalar-conversion refuse path (Duration scalars in haystack fail polars_scalar_to_vortex) alongside the existing nulls_equal+nulls refuse-path note.
|
The uncompressed lib size after this PR is 59.4899 MB. |
…Ternary arm (review finding) The cycle-1 must-fix #2 fix added a THEN/ELSE pairwise-dtype gate to the Ternary convertor arm, calling resolve_inner_dtype(*truthy)? to verify match. But resolve_inner_dtype had no Ternary arm — fell to the catchall _=>None — so any NESTED Ternary truthy/falsy would unconditionally return None, ?-propagating to refuse pushdown even when inner dtypes match. Fail-closed-safe but a real coverage regression vs the pre-gate baseline. Add an AExpr::Ternary arm to resolve_inner_dtype that recurses on truthy+falsy and returns Some(t) when t==e, mirroring the Plus arm's self-contained verification pattern at L726-730. Walk truthy + falsy, require equality (Vortex's case_when builder requires THEN and ELSE share a return_dtype), unresolvable shapes fall through to None. Add a regression test shape_ternary_nested_matching_dtypes_pushes that constructs a 2-level nested Ternary (outer.truthy = inner Ternary) and asserts pushdown engages. Before this fix the test would have asserted .is_some() but produced None.
…/is_in imports (review finding) PR-2.7 added 5 new imports (AnyValue, VortexScalar, like, or_collect, and the fn bytes_to_like_literal) that are used ONLY inside arms gated on strings or is_in, but the imports themselves were ungated. Under `cargo build -p polars-plan --features vortex --no-default-features` this produced 5 fresh unused-import warnings; the umbrella CI build uses --all-features so the warnings didn't fail loud. Gate each import + the fn definition matching the existing pattern (e.g., the cycle-1 PR-2.7 already gated IRStringFunction on `#[cfg(feature = "strings")]` — same idiom for the underlying types the StringExpr arm uses). Split AnyValue from DataType in the prelude line (DataType is used by every arm — must stay ungated; AnyValue is used only in the StringExpr needle extraction). Split like/or_collect out of the multi-import on the vortex::expr line and gate each on strings/is_in respectively. Verified clean: cargo check -p polars-plan --features vortex (no warnings on vortex_convertor.rs), with all PR-2.7 features (--features vortex,is_between,is_in,regex,strings,dtype-struct, same), and umbrella cargo check -p polars --features vortex,cloud,parquet,dtype-full. 88 unit tests pass.
…rs#5 — strengthen structural tests (review finding) Cycle-1 added 5 structural-assertion tests (shape_is_between_structural, shape_starts_with_structural, shape_ends_with_structural, shape_contains_literal_true_structural, shape_ternary_structural) to close cycle-1 acceptance criterion (c) but cycle-2 reviewers flagged three with loose assertions that don't catch specific paste-swap classes: pola-rs#3 shape_is_between_structural: vacuous negative anchor '!s.contains("eq(") && !s.contains("==")'. Vortex's binary Display uses single-char '=' ('(x = 10)'), never 'eq(' or '=='; both substrings are absent in correct AND eq-paste-swap outputs. Replace with 's.contains(">") && s.contains("<")' (an eq-paste-swap '(x = 10) AND (x = 20)' has neither '>' nor '<'). pola-rs#4 shape_starts_with_structural + shape_ends_with_structural: don't catch starts↔ends paste-swap. Both produce 'like'+needle+'%' Display strings differing only in '%' position. Strengthen with joint substring (s.contains("prefix%")) + negative anchor (!s.contains("%prefix")). Mirror for ends_with. pola-rs#5 shape_ternary_structural: doesn't catch truthy↔falsy paste-swap. case_when(cond, b, a) still satisfies s.contains('a') && s.contains('b') because both column refs remain present. Switch truthy/falsy from cols 'a' / 'b' to distinct literals 777 / 888 + assert positional ordering (t_pos < f_pos in the Display string). All 88 tests pass (5 structural tests strengthened; shape_plus_arithmetic_structural unchanged).
…+ nit pola-rs#7 — README gate notes + is_in empty-haystack comment (review finding) - README pushdown table: add the three schema-gate notes that the cycle-1 fix-commits added to the convertor module-doc table but didn't propagate to README. is_between → '(schema-gated pairwise-PType)'. starts_with / ends_with / contains → '(schema-gated Utf8 input)'. Ternary → '(schema-gated THEN/ELSE pairwise-dtype)'. Closes the H1 cross-reference drift between convertor source and README documented in cycle-2 finding pola-rs#6. - vortex_convertor.rs is_in arm: add a 6-line comment before or_collect(terms) acknowledging that 'or_collect(vec![])' returns None for an empty haystack → arm returns None. Polars's post-decode residual reapply handles the all-false 'is_in([])' semantic correctly, so this is a missed optimization (we could pre-empt with lit(false)) — not a bug. Closes cycle-2 finding nit pola-rs#7. All 88 unit tests still pass.
…e Deferred entry Polars's is_in uses TotalOrd (NaN==NaN); Vortex's eq uses IEEE 754 (NaN!=NaN). This divergence affects ALL float-comparison convertor arms (foundation eq/not_eq/lt/lt_eq/gt/gt_eq since PR-2.1 + PR-2.7's new is_in/is_between). The legacy SpecializedColumnPredicate::EqualOneOf path had the same divergence; PR-2.7 just amplifies the surface. Add a Deferred work entry capturing the divergence with two mitigation paths: (a) refuse pushdown when any float literal is NaN (~10 LoC, conservative); (b) accept the divergence and document as known limitation. Not blocking — residual reapply preserves correctness. Tracked so a future read doesn't assume float pushdown is fully isomorphic with Polars semantics.
… ready for cycle 3
|
The uncompressed lib size after this PR is 59.3436 MB. |
Three-cycle inner-loop gauntlet history: - Cycle 1 REJECT (3 must-fix + 6 should-fix + 2 nit) — schema-gate gaps in the same bug class as PR-2.3 / PR-2.4 cycle-1 pairwise-PType must-fixes. - Cycle 2 ACCEPT (0 must-fix + 6 should-fix + 2 nit) — gates correctly implemented; user picked maximum-thoroughness apply-all-polish-inline. - Cycle 3 ACCEPT (0 must-fix + 0 should-fix + 1 nit) — polish verified clean; only finding is pre-existing CastOptions unused-import (PR-2.3 era, out of PR-2.7 scope). 17 PR-work commits 065df50..71540df (feat: convertor arms; docs: e2e tests + README; plan: amend complete + awaiting + cycle 1 atomic; fix×3 + decrement×3 must-fix gates; docs: cycle-1 should-fix sweep; fix×3 cycle-2 should-fix; docs + plan: cycle-2 polish + NaN Deferred + cycle-2 polish landed). 25 new convertor unit tests + 7 new e2e Python tests. 88 total convertor tests pass. Current State advances: status: executing, current_pr: PR-2.8, pr_index: 9, per-PR counters reset. Phase 2 phase-end review (cycle 2) will fire after PR-2.8 completes. Append Implementation status entry capturing scope shipped, tests added, 3-cycle review history, confidence, deferred items (1 new: Float NaN semantic divergence), and 3 surprises during implementation (H4 self- reinforcement validated, H2 new-prose internal edge case validated, prior_fix_commit_sha attention block worked as designed).
Summary
Phase 2 of the polars-vortex integration. Replaces Phase 1's
SpecializedColumnPredicate-based filter-pushdown path with an AExpr-direct convertor that handles a strictly broader shape set — arithmetic, CAST, struct field access, multi-column predicates — while preserving the always-SAFE residual-fallback contract.Stacked on PR #2 (Phase 1: functional + benches). Base branch is
vortex-integration-phase-1. Review #2 first; this PR rebases ontomainonce #2 lands.Status: 49 commits on top of Phase 1's tip, +2,658 / -371 LOC across 17 files, 4-vote phase-end gauntlet (spec + correctness + maint + arch) accepted by all 4 reviewers at high confidence.
cargo check -p polars --features vortex,cloud,parquet,dtype-fullclean. Default Polars build (novortexfeature) unaffected.What this PR ships
The AExpr-direct convertor (PR-2.1 through PR-2.6)
A new module at
crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs(~1100 LoC + 64 unit tests) walks the PolarsArena<AExpr>directly at IR-build time, inpolars-stream/src/physical_plan/lower_ir.rs::FileScanIR::Vortexarm, whereexpr_arenais still in scope. The resulting VortexExpressionis threaded ontoVortexReaderBuilder.aexpr_filterand consumed byVortexFileReader::begin_read→ScanBuilder::with_filter. PR-2.6 deletes the legacypolars_to_vortex_predicate(Option B → A cutover complete).16 AExpr shapes implemented + 4 schema gates:
AExpr::Column(name)get_item(name, root())AExpr::Literal(LiteralValue::Scalar(s))lit(s)eq/not_eq/lt/lt_eq/gt/gt_eqAnd/LogicalAnd/Or/LogicalOr)and/orPlus)checked_addcast(child, target_dtype)col.struct.field("inner"))get_item(field, struct_expr)IsNull/IsNotNullis_null/is_not_nullNotnotThe gates are load-bearing: Vortex's per-array
CastKernelis strictly within-kind (Primitive↔Primitive only, Bool↔Bool only, Utf8↔Utf8 only — cross-kind castsvortex_bail!at scan-time);checked_add'sBinary::coerce_argsrequires same-PType numeric operands; comparisons require same-PType non-extension operands. Without the gates, predicates likepl.col(int_col).cast(pl.String) == "x"would scan-time-error instead of falling back to residual. Each gate was caught in cycle-1 review of the corresponding PR and fixed inline.Option B → A cutover (PR-2.6)
Deletes
polars_to_vortex_predicate+convert_specialized+bytes_to_like_literalfrompolars-vortex/src/read/predicate.rs(~244 net LoC removed). Preservespolars_scalar_to_vortex(the canonicalScalar→VortexScalarmapping; PR-2.1 exposedpubfor cross-crate reuse from the convertor). The cutover means the AExpr-direct convertor is the SOLE filter-pushdown path going forward.Substantive PR-2.0 housekeeping (Dedicated single-cache refactor)
Pre-PR-2.0,
VortexCacheMode::Dedicated(N)allocated TWO Moka caches per logical scan — one at IR-build (for the schema-discovery postscript read) and one at streaming-time (for the data read). PR-2.0 introduces aVortexSegmentCacheRefnewtype wrapper (Arc with Debug + Deref, sinceFileScanIRderives Debug andSegmentCachedoesn't impl Debug) and threads the resolved Arc fromvortex_file_info→FileScanIR::Vortex::segment_cache→VortexReaderBuilder→VortexFileReader::initialize. Now ONE cache instance is shared across IR-build + data reads. Plus 10 cycle-3 should-fix carry-forward items swept (4 plan-doc + 4 code-doc + 2 mixed).User-visible behavior change
Predicates that previously fell back to "decode everything then filter post-decode" now push down to Vortex's zone pruner. PR #2's bench harness anchors the measurement; compare on PR #2 vs. this PR's tip:
The
vortex_scan/filter_arithmeticbench (col + 1 == N) is the key Phase 1 → Phase 2 measurement. The other two (no_filter,filter_lt) anchor the no-regression invariant — neither phase should show meaningfully different perf on those.Phase 2 retroactively closes Phase 2 exit criterion (c) "e2e tests verify pushdown engagement" — the bench's
filter_arithmeticis a stronger engagement signal than the originally-plannedPOLARS_VORTEX_VERIFY_PUSHDOWNdebug log (which is deferred); if Phase 2 didn't actually push down, the bench numbers would be flat against Phase 1.Sub-PR breakdown (7 sub-PRs, 6 landed, 1 SLIPPED)
VortexSegmentCacheRefnewtype +FileScanIR::Vortex.segment_cachethread-through + 10 cycle-3 should-fix carry-forward items)lower_ir.rs:780+ Plus arithmetic + And/Or schema gate. Cycle-1 surfaced hive-column reachability + Plus dtype gate must-fix items (file_info.schema"Always includes all hive columns"; Vortexchecked_adderrors at scan-time on non-numeric / overflow). Cycle-2 extended virtual-column guard to also refuserow_index+include_file_paths.CastKernelis strictly within-kind;NonStrict/OverflowingPolars semantics diverge from Vortex's fail-on-overflow).col.struct.field("inner")) + proactive comparison pairwise-PType gate (H4 sibling bug surfaced in PR-2.3 cycle-2 review). Resolved the Plus cross-PType deferred entry.datetime_parts/year/month/ etc. builders invortex::expr::*. Convertor's residual fallback handles correctly via the_ => Nonearm; multi-scan reapply preserves correctness. Tracked as Deferred work pending upstream Vortex API or pin bump.SpecializedColumnPredicatefast path (Option B → A cutover). Cycle-1 surfaced silent pushdown coverage regression on 4 shapes (is_between,is_in,starts_with,ends_with) that the legacy path handled; formally deferred to a follow-up PR per Phase 2 exit (b)'s "documented as deliberately deferred" clause.Final 4-vote phase-end review
Phase 2 closes with all 4 reviewers (
spec+correctness+maint+arch) ACCEPT at high confidence. Zero must-fix items. ~10 should-fix observations addressed inline (CI lints fixed, plan §4 architectural-moves drift fixed, orphanedaexpr_predicate.rsstub deleted, staleAccepted tradeoffsentry struck through,Function String::Contains{literal:true}andTernaryadded to Deferred work to satisfy Phase 2 exit criterion (b)).Deferred work growth (14 entries cumulative; +7 this phase)
Most-impactful new deferrals:
is_between,is_in,starts_with,ends_with,Contains{literal:true},Ternary) — ~65 LoC + tests mechanical port; perf-only regression, correctness preserved via multi-scan reapply.datetime_partsbuilder availability.POLARS_VORTEX_VERIFY_PUSHDOWNdebug mode — partial form ships via the structural-assertion test for Plus + the new bench harness in PR feat(rust): Vortex Phase 1 — native file-format integration + Criterion benches #2.Tests (cumulative)
vortex_convertor.rs::tests(15 shape-coverage + cross-PType refusals + schema-gate refusals + structural-assertion for Plus + nested-struct chain + cycle-1/2 regression tests across PR-2.2-.4-.6).py-polars/tests/unit/io/test_vortex.py:test_scan_with_arithmetic_filter—col + 1 == 5(PR-2.2)test_scan_with_cast_filter—col.cast(Int64) > 100(PR-2.3)test_scan_with_cross_kind_cast_filter—col.cast(String)regression (PR-2.3 cycle-1 fix)test_scan_with_struct_field_filter—col.struct.field("inner") == "x"(PR-2.4)test_scan_with_hive_partitioning_and_filter— virtual-column guard regression (PR-2.2 cycle-1)test_scan_with_row_index_and_filter— virtual-column guard extension (PR-2.2 cycle-2)Cross-crate changes
polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs(new)polars-plan/src/plans/aexpr/predicates/mod.rspolars-plan/src/plans/conversion/dsl_to_ir/scans.rsvortex_file_infosegment_cache thread-through (PR-2.0).polars-plan/src/dsl/file_scan/mod.rsFileScanIR::Vortex.segment_cachefield (PR-2.0).polars-plan/src/plans/optimizer/expand_datasets.rspolars-stream/src/physical_plan/lower_ir.rspolars-stream/src/nodes/io_sources/vortex/builder.rsaexpr_filter: Option<VortexExpression>field.polars-stream/src/nodes/io_sources/vortex/mod.rsVortexFileReader.aexpr_filterfield +begin_readconsumer.polars-vortex/src/lib.rsdtype+exprto narrowedvortex::re-export (for the convertor'sDType::Primitive/Expressionreferences).polars-vortex/src/read/mod.rsVortexSegmentCacheRefnewtype + 3 Arc-identity tests (PR-2.0).polars-vortex/src/read/predicate.rspolars_to_vortex_predicate+convert_specialized+bytes_to_like_literal; keptpolars_scalar_to_vortex+ temporal/decimal helpers + 2 new Decimal regression tests.polars-vortex/README.mdpolars-mem-engine/src/scan_predicate/functions.rsFileScanIR::Vortex.segment_cachedestructure site (PR-2.0).py-polars/tests/unit/io/test_vortex.py.big-plans/vortex-integration.mdTest plan
cargo check -p polars --features vortex,cloud,parquet,dtype-full— cleancargo test -p polars-plan --features vortex,is_between,is_in,dtype-struct vortex_convertor— 63 tests pass (withoutdtype-struct: 58)cargo test -p polars-vortex --features dtype-date,dtype-datetime,dtype-time,dtype-decimal predicate— 8 tests passcargo check -p polars-stream --features vortex,cloud— cleancargo fmt --all -- --check— cleantest_vortex.py)cargo bench -p polars ... --bench io_vortex -- --baseline phase-1showing the expectedfilter_arithmeticspeedup vs. PR feat(rust): Vortex Phase 1 — native file-format integration + Criterion benches #2's saved baseline🤖 Generated with Claude Code