Skip to content

feat: native single-hop Expand via CSR adjacency index (Phase 2, #159)#162

Open
jja725 wants to merge 10 commits into
lance-format:mainfrom
jja725:feat/csr-native-expand
Open

feat: native single-hop Expand via CSR adjacency index (Phase 2, #159)#162
jja725 wants to merge 10 commits into
lance-format:mainfrom
jja725:feat/csr-native-expand

Conversation

@jja725

@jja725 jja725 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Phase 2 of #159: execute single-hop Cypher Expand natively using the Phase 1 CsrIndex instead of relationship-scan + two SQL joins. Implemented as custom DataFusion operators, with automatic fallback to the existing join path for anything not yet supported natively.

This is a DuckPGQ-style relational engine + CSR acceleration integration:

  • CsrExpandExec — topology only: for each source row, look up neighbors in the CSR and emit one row per neighbor with the neighbor's row id appended.
  • LanceTakeExec — materializes target node properties from those row ids via a RowMaterializer (in-memory arrow::compute::take now; a Lance-dataset take lands in Phase 4).
  • CsrExpandNode / LanceTakeNode — logical extension nodes; CsrExtensionPlanner + CsrQueryPlanner build the CSR and materializer at physical-planning time.
  • LanceNativePlanner overrides only Expand lowering and delegates everything else to DataFusionPlanner, so ExecutionStrategy::LanceNative is always correct — it uses CSR when it can and joins otherwise.

Design decisions

  • Dense-ROWID model: the CSR vertex id is the node's row id (csr.neighbors(src_rowid) -> dst_rowids), mirroring how every Lance index works (key → row ids → take() to materialize). Generalizes to Lance stable row ids in Phase 4.
  • Materialize all target columns via take() (schema parity with the join path) rather than analyzing which are referenced.

Full design and task breakdown: docs/superpowers/specs/2026-06-22-csr-native-expand-operator-design.md and docs/superpowers/plans/2026-06-22-csr-native-expand-operator.md.

Native vs. fallback

Served natively: exactly one single-hop Expand, single relationship type, Outgoing/Incoming, no inline relationship/target property filters, no bound relationship variable. Wrapping Project/Filter/Sort/Limit/Offset/Distinct run as normal DataFusion operators on the native stream.

Falls back to the DataFusion join path: variable-length / multi-hop, multiple relationship types, undirected, inline {k:v} filters, bound relationship variable, Join, Unwind.

Out of scope (later phases)

  • Multi-hop / VariableLengthExpand, BFS/DFS/shortest-path operators (Phase 3)
  • Persisting CSR as Lance datasets, incremental updates, stable row ids, LanceDatasetMaterializer, namespace native path (Phase 4)
  • Hybrid CSR + vector search (Phase 5)

Test plan

  • Unit tests for expand_batch, take_batch, InMemoryMaterializer, the CSR builder column generalization, and the planner native/fallback decision.
  • End-to-end parity tests (tests/test_lance_native_expand.rs) asserting LanceNative returns identical results to DataFusion for: single-hop RETURN a.name, b.name, with WHERE b.age > 30, incoming direction, and a variable-length query (fallback).
  • Full crate suite green: cargo test -p lance-graph (15 binaries, 0 failures); cargo clippy -p lance-graph --all-targets clean.

Known follow-ups (not blocking)

  • Resolve the source-id column index at execution time (by name) rather than at plan time — robust against future multi-partition / column-pruning passes (safe under DataFusion 50 today).

🤖 Generated with Claude Code

jja725 and others added 10 commits June 22, 2026 23:20
Design spec for issue lance-format#159 Phase 2: wire the Phase 1 CsrIndex into a
native single-hop Expand via custom DataFusion ExecutionPlan
(CsrExpandExec topology + LanceTakeExec materialization), dense-ROWID
id model, with fallback to the DataFusion join path for unsupported
shapes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7-task TDD plan implementing the approved design: generalize CSR builder,
CsrExpandNode/Exec, LanceTakeNode/Exec + RowMaterializer, CsrExtensionPlanner/
CsrQueryPlanner, LanceNativePlanner lowering with fallback, and query.rs wiring
of the LanceNative execution strategy with end-to-end parity tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tch_with_columns

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jja725 jja725 marked this pull request as ready for review June 23, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants