Factorized intermediate results for native multi-hop traversal

## The problem

Our native multi-hop traversal materializes intermediate results in flat (fully enumerated) form. `expand_batch` flattens the CSR adjacency immediately — for each input row it emits one output row per neighbor and `take()`s every carried column once per neighbor:

```rust
// crates/lance-graph/src/lance_native_planner/csr_expand.rs
for &n in csr.neighbors(src.value(row)) {
    parent_idx.push(row_u32);   // repeat the parent row
    neighbors.push(n);          // once per neighbor
}
// ...then take() duplicates every carried column once per neighbor
```

For a chain like `(a)-[:KNOWS]->(b)-[:LIKES]->(c)`, the `a` columns get duplicated once per `(b, c)` path, so intermediate size grows as O(N^k) for k hops. This is the classic many-to-many blowup.

## Background

KuzuDB (and the factorized-database line of work it builds on) avoids this with **factorized intermediate results**: keep intermediates in a compressed, Cartesian-product form so projections and aggregations push through without materializing the explosion. See [What is KuzuDB](https://www.puppygraph.com/blog/what-is-kuzudb).

## Why this fits lance-graph well

CSR already produces the factorized group for free: `csr.neighbors(src)` returns a contiguous slice, which is exactly the shape of an Arrow `ListArray` (offset array + values), i.e. a nested child group. Today we take that naturally-nested output and flatten it. A factorized expand simply keeps the nesting — emit `(parent values once) + (list of neighbors)` instead of `(parent values × neighbors)`.

It also belongs entirely in `LanceNativePlanner`. DataFusion's `RecordBatch` model is flat-relational, so factorization lives in the native operator tree, not the DataFusion fallback. This reinforces the native-planner direction in #159.

## Scope and caveats

- **Survives only inside a native-operator chain.** Handing a batch back to a DataFusion operator (cross-table join, sort, filter on a cross-hop expression) requires unfolding to flat. Value is proportional to how long a native chain we can keep.
- **Payoff needs consumers that push through nested form** — aggregations (`count`, `EXISTS`, `collect`), `LIMIT`, degree-style queries, heavily-filtered deep traversals. For `RETURN a, b, c` with no aggregation the final output is the flat product anyway; factorization still saves intermediate memory/CPU but the last step enumerates.
- **Multi-hop only.** Single-hop expand gets nothing (its output is just the neighbor set). This is gated on Phase 3 (multi-hop / `VariableLengthExpand`) in #159.

## Proposed approach

- [ ] **Step 1 — nested expand output (correctness-preserving)**
  - [ ] Have native expand emit a nested neighbor column (Arrow `ListArray`) instead of flattening
  - [ ] Add a `Flatten`/`Unfold` operator at the native → DataFusion boundary
  - [ ] Wire it so any plan can always unfold to flat (no behavior change, just a new option)
- [ ] **Step 2 — factorized consumers**
  - [ ] Factorized `count` / `EXISTS` that consume the nested form without unfolding
  - [ ] Push projections through the factorized representation
  - [ ] `LIMIT` short-circuit over nested groups
- [ ] **Step 3 — multi-hop chains**
  - [ ] Keep factorization across a chain of native expands (Phase 3 `VariableLengthExpand`)
  - [ ] Unfold only at the final boundary or when an operator can't consume nested input
- [ ] **Step 4 — benchmarks**
  - [ ] Many-to-many multi-hop workload (flat vs. factorized) on intermediate size, memory, latency

## Related

- #159 (CSR native traversal) — this is a Phase 3 sub-track of that effort
- Possible follow-up: worst-case optimal joins (WCOJ) for cyclic patterns (triangles), complementary to factorization


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factorized intermediate results for native multi-hop traversal #163

The problem

Background

Why this fits lance-graph well

Scope and caveats

Proposed approach

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Factorized intermediate results for native multi-hop traversal #163

Description

The problem

Background

Why this fits lance-graph well

Scope and caveats

Proposed approach

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions