The problem
Today lance-graph translates every Cypher query into SQL joins through DataFusion. A simple (a)-[:KNOWS]->(b)-[:LIKES]->(c) becomes a chain of self-joins that scales as O(N^k) for k hops. The LanceNativePlanner exists as a placeholder but doesn't do anything yet.
We need a CSR (Compressed Sparse Row) adjacency index so that graph traversals can do direct neighbor lookups instead of join scans.
Prior art and what we learned
We looked at three approaches and landed on a direction that combines ideas from DuckPGQ and icebug-format.
GraphAr (Apache incubator)
GraphAr showed that you can encode CSR offset tables directly in Parquet files, giving you adjacency-list access without SQL joins. The spec says: "the sorted adjList table together with the offset table can be used as CSR, supporting fast access of outgoing edges for a given vertex."
However, GraphAr is fundamentally a data interchange format -- it's designed for archiving and bulk loading into graph databases, not for online querying. It uses a deeply nested directory structure with YAML manifests, which adds overhead when you want to mount and query immediately.
icebug-format (Ladybug-Memory)
icebug-format takes a different stance that's closer to what we want. Key design choices:
- Cypher schema as catalog instead of YAML/JSON config -- a
schema.cypher file like CREATE NODE TABLE nodes(id INT64, club STRING, PRIMARY KEY(id)); CREATE REL TABLE edges(FROM nodes TO nodes); that a graph database can mount directly
- Flat file layout -- three Parquet files per edge type (
nodes_<name>.parquet, indices_<name>.parquet, indptr_<name>.parquet) instead of GraphAr's nested directories
- Two flavors:
icebug-disk (Parquet on object storage) and icebug-memory (Arrow tables, zero-copy)
- DuckDB-native CSR construction -- builds the CSR using DuckDB SQL, no custom binary tooling
- Designed like native tables in a graph database -- less generality than GraphAr, but you can mount it and query instantly
The philosophy is similar to DuckLake: keep metadata in the system catalog of the database, not in sidecar config files.
DuckPGQ (CIDR 2023, peer-reviewed)
DuckPGQ validated that the relational engine + CSR acceleration pattern works well in practice:
- Rewrites SQL/PGQ graph patterns into standard SQL joins (same as what we do with DataFusion)
- Adds vectorized CSR-based UDFs for path-finding (Multi-Source BFS with SIMD/AVX-512)
- Creates CSR on-the-fly from relational tables using ROWIDs -- no separate graph storage engine
- On LDBC LSQB benchmarks (SF10-SF100): within an order of magnitude of Umbra, consistently outperforms Neo4j (which fails 4 queries on SF100)
- BFS scales to SF3000 (9M vertices, 670M edges), while Umbra's recursive SQL OOMs on SF10
The key takeaway: you don't need a separate graph database. A relational engine with CSR acceleration handles both pattern matching (via joins) and path-finding (via native BFS on CSR).
SQL/PGQ expressiveness limits
Our research also surfaced where SQL/PGQ breaks down, which tells us what the native CSR engine must handle:
| Pattern type |
SQL/PGQ? |
Native CSR needed? |
| Fixed-length (2-3 hops) |
Standard joins |
No |
Unbounded (Kleene *, +) |
Recursive SQL or BFS |
Yes -- recursive CTEs OOM on large graphs |
| Cross-iteration edge properties (e.g. increasing timestamps along paths) |
Fundamentally inexpressible |
Yes -- only native traversal can do this |
Direction
We think the right approach for lance-graph combines ideas from DuckPGQ and icebug-format, with one important revision based on discussion in this thread: the CSR adjacency structure should be a native Lance index kind, not a sidecar file that lance-graph maintains on the side.
The original sketch (CSR stored as separate Lance datasets in a flat layout) has the consistency and ownership problems @aheev raised: who keeps the adjacency in sync with edge writes, who owns the dataset, and what happens under concurrent access. Making CSR a first-class Lance index removes all of that -- index maintenance rides on Lance's own commit + index-update path, tied to a dataset version, the same way scalar/vector/FTS indices work today. lance-graph doesn't need exclusive access; Lance owns the index lifecycle and lance-graph just reads it.
- CSR/adjacency as a native Lance secondary index -- a new index kind over an edge dataset's
(src, dst) columns, built and updated through Lance's existing index infrastructure (versioned, transactional, lazily loaded at query time)
- Native traversal operators in
LanceNativePlanner that use CSR for Expand and VariableLengthExpand instead of SQL joins
- Cypher-native schema rather than YAML/JSON config (we already parse Cypher, so this is natural)
- In-memory CSR from Arrow tables for the zero-ETL case (same idea as icebug-memory), sharing the same neighbor-lookup API as the persistent index
Storage model
Edges stay a normal Lance dataset (knows.lance) with whatever properties the user defines -- no required row ordering. The adjacency lives as a secondary index on that dataset:
- Offset table --
{vertex_id, offset, degree}, one entry per vertex plus a sentinel
- Neighbor list --
{dst_id} ordered by source vertex
For vertex v, its neighbors are neighbors[offsets[v]..offsets[v+1]]. Direction is a build-time property of the index, not the table: outgoing (CSR) keys on src, incoming (CSC) is the same construction with the columns swapped (key on dst). The two directions are independent index instances -- the reverse adjacency is materialized by re-sorting on the destination column, since reverse edges are not contiguous in the forward layout.
At query time the index is range-loaded: for a set of source vertices we read only their offset entries and the corresponding neighbor slices via take()/range scans, not the whole adjacency.
Why a native index over a sidecar
- Atomic, versioned maintenance -- adjacency updates commit with edge inserts through Lance's index path; no drift, no separate exclusive owner
- Lazy, partial loading -- only touched vertices' adjacency ranges are read, same as other Lance indices
- Random access via
take() -- O(1) row lookup by ID during traversal
- Vector search integration -- combine graph traversal with ANN search ("find 2-hop neighbors whose embeddings are similar to X")
Implementation plan
Phase 1 -- CSR index builder ✅ (done -- #160)
Phase 2 -- Native single-hop expand
Phase 3 -- Multi-hop and algorithms
Phase 4 -- Native Lance adjacency index (replaces the original sidecar-persistence plan)
Phase 5 -- Hybrid queries
References
The problem
Today lance-graph translates every Cypher query into SQL joins through DataFusion. A simple
(a)-[:KNOWS]->(b)-[:LIKES]->(c)becomes a chain of self-joins that scales as O(N^k) for k hops. TheLanceNativePlannerexists as a placeholder but doesn't do anything yet.We need a CSR (Compressed Sparse Row) adjacency index so that graph traversals can do direct neighbor lookups instead of join scans.
Prior art and what we learned
We looked at three approaches and landed on a direction that combines ideas from DuckPGQ and icebug-format.
GraphAr (Apache incubator)
GraphAr showed that you can encode CSR offset tables directly in Parquet files, giving you adjacency-list access without SQL joins. The spec says: "the sorted adjList table together with the offset table can be used as CSR, supporting fast access of outgoing edges for a given vertex."
However, GraphAr is fundamentally a data interchange format -- it's designed for archiving and bulk loading into graph databases, not for online querying. It uses a deeply nested directory structure with YAML manifests, which adds overhead when you want to mount and query immediately.
icebug-format (Ladybug-Memory)
icebug-format takes a different stance that's closer to what we want. Key design choices:
schema.cypherfile likeCREATE NODE TABLE nodes(id INT64, club STRING, PRIMARY KEY(id)); CREATE REL TABLE edges(FROM nodes TO nodes);that a graph database can mount directlynodes_<name>.parquet,indices_<name>.parquet,indptr_<name>.parquet) instead of GraphAr's nested directoriesicebug-disk(Parquet on object storage) andicebug-memory(Arrow tables, zero-copy)The philosophy is similar to DuckLake: keep metadata in the system catalog of the database, not in sidecar config files.
DuckPGQ (CIDR 2023, peer-reviewed)
DuckPGQ validated that the relational engine + CSR acceleration pattern works well in practice:
The key takeaway: you don't need a separate graph database. A relational engine with CSR acceleration handles both pattern matching (via joins) and path-finding (via native BFS on CSR).
SQL/PGQ expressiveness limits
Our research also surfaced where SQL/PGQ breaks down, which tells us what the native CSR engine must handle:
*,+)Direction
We think the right approach for lance-graph combines ideas from DuckPGQ and icebug-format, with one important revision based on discussion in this thread: the CSR adjacency structure should be a native Lance index kind, not a sidecar file that lance-graph maintains on the side.
The original sketch (CSR stored as separate Lance datasets in a flat layout) has the consistency and ownership problems @aheev raised: who keeps the adjacency in sync with edge writes, who owns the dataset, and what happens under concurrent access. Making CSR a first-class Lance index removes all of that -- index maintenance rides on Lance's own commit + index-update path, tied to a dataset version, the same way scalar/vector/FTS indices work today. lance-graph doesn't need exclusive access; Lance owns the index lifecycle and lance-graph just reads it.
(src, dst)columns, built and updated through Lance's existing index infrastructure (versioned, transactional, lazily loaded at query time)LanceNativePlannerthat use CSR forExpandandVariableLengthExpandinstead of SQL joinsStorage model
Edges stay a normal Lance dataset (
knows.lance) with whatever properties the user defines -- no required row ordering. The adjacency lives as a secondary index on that dataset:{vertex_id, offset, degree}, one entry per vertex plus a sentinel{dst_id}ordered by source vertexFor vertex
v, its neighbors areneighbors[offsets[v]..offsets[v+1]]. Direction is a build-time property of the index, not the table:outgoing(CSR) keys onsrc,incoming(CSC) is the same construction with the columns swapped (key ondst). The two directions are independent index instances -- the reverse adjacency is materialized by re-sorting on the destination column, since reverse edges are not contiguous in the forward layout.At query time the index is range-loaded: for a set of source vertices we read only their offset entries and the corresponding neighbor slices via
take()/range scans, not the whole adjacency.Why a native index over a sidecar
take()-- O(1) row lookup by ID during traversalImplementation plan
Phase 1 -- CSR index builder ✅ (done -- #160)
CsrIndex(offset/indptr array + neighbor array)Phase 2 -- Native single-hop expand
LanceNativePlannerfor single-hopExpandPhase 3 -- Multi-hop and algorithms
VariableLengthExpand(Kleene*/+)Phase 4 -- Native Lance adjacency index (replaces the original sidecar-persistence plan)
lancerepo(src, dst)columns{vertex_id, offset, degree}+ neighbor list{dst_id}ordered by sourcetake()/ range scansoutgoingandincomingas two independent index instancesPhase 5 -- Hybrid queries
References