feat: CSR adjacency index for native graph traversal (inspired by DuckPGQ + icebug-format)

## The problem

Today lance-graph translates every Cypher query into SQL joins through DataFusion. A simple `(a)-[:KNOWS]->(b)-[:LIKES]->(c)` becomes a chain of self-joins that scales as O(N^k) for k hops. The `LanceNativePlanner` exists as a placeholder but doesn't do anything yet.

We need a CSR (Compressed Sparse Row) adjacency index so that graph traversals can do direct neighbor lookups instead of join scans.

## Prior art and what we learned

We looked at three approaches and landed on a direction that combines ideas from DuckPGQ and icebug-format.

### GraphAr (Apache incubator)

GraphAr showed that you can [encode CSR offset tables directly in Parquet files](https://arxiv.org/html/2312.09577v4), giving you adjacency-list access without SQL joins. The spec says: *"the sorted adjList table together with the offset table can be used as CSR, supporting fast access of outgoing edges for a given vertex."*

However, GraphAr is fundamentally a **data interchange format** -- it's designed for archiving and bulk loading into graph databases, not for online querying. It uses a deeply nested directory structure with YAML manifests, which adds overhead when you want to mount and query immediately.

### icebug-format (Ladybug-Memory)

[icebug-format](https://github.com/Ladybug-Memory/icebug-format) takes a different stance that's closer to what we want. Key design choices:

- **Cypher schema as catalog** instead of YAML/JSON config -- a `schema.cypher` file like `CREATE NODE TABLE nodes(id INT64, club STRING, PRIMARY KEY(id)); CREATE REL TABLE edges(FROM nodes TO nodes);` that a graph database can mount directly
- **Flat file layout** -- three Parquet files per edge type (`nodes_<name>.parquet`, `indices_<name>.parquet`, `indptr_<name>.parquet`) instead of GraphAr's nested directories
- **Two flavors**: `icebug-disk` (Parquet on object storage) and `icebug-memory` (Arrow tables, zero-copy)
- **DuckDB-native CSR construction** -- builds the CSR using DuckDB SQL, no custom binary tooling
- **Designed like native tables in a graph database** -- less generality than GraphAr, but you can mount it and query instantly

The philosophy is similar to DuckLake: keep metadata in the system catalog of the database, not in sidecar config files.

### DuckPGQ (CIDR 2023, peer-reviewed)

DuckPGQ validated that the **relational engine + CSR acceleration** pattern works well in practice:

- Rewrites SQL/PGQ graph patterns into standard SQL joins (same as what we do with DataFusion)
- Adds **vectorized CSR-based UDFs** for path-finding (Multi-Source BFS with SIMD/AVX-512)
- Creates CSR **on-the-fly from relational tables** using ROWIDs -- no separate graph storage engine
- On LDBC LSQB benchmarks (SF10-SF100): within an order of magnitude of Umbra, consistently outperforms Neo4j (which fails 4 queries on SF100)
- BFS scales to SF3000 (9M vertices, 670M edges), while Umbra's recursive SQL OOMs on SF10

The key takeaway: **you don't need a separate graph database**. A relational engine with CSR acceleration handles both pattern matching (via joins) and path-finding (via native BFS on CSR).

### SQL/PGQ expressiveness limits

Our research also surfaced where SQL/PGQ breaks down, which tells us what the native CSR engine must handle:

| Pattern type | SQL/PGQ? | Native CSR needed? |
|---|---|---|
| Fixed-length (2-3 hops) | Standard joins | No |
| Unbounded (Kleene `*`, `+`) | Recursive SQL or BFS | Yes -- recursive CTEs OOM on large graphs |
| Cross-iteration edge properties (e.g. increasing timestamps along paths) | Fundamentally inexpressible | Yes -- only native traversal can do this |

## Direction

We think the right approach for lance-graph combines ideas from DuckPGQ and icebug-format, with one important revision based on discussion in this thread: **the CSR adjacency structure should be a native Lance index kind, not a sidecar file that lance-graph maintains on the side.**

The original sketch (CSR stored as separate Lance datasets in a flat layout) has the consistency and ownership problems @aheev raised: who keeps the adjacency in sync with edge writes, who owns the dataset, and what happens under concurrent access. Making CSR a first-class Lance index removes all of that -- index maintenance rides on Lance's own commit + index-update path, tied to a dataset version, the same way scalar/vector/FTS indices work today. lance-graph doesn't need exclusive access; Lance owns the index lifecycle and lance-graph just reads it.

1. **CSR/adjacency as a native Lance secondary index** -- a new index kind over an edge dataset's `(src, dst)` columns, built and updated through Lance's existing index infrastructure (versioned, transactional, lazily loaded at query time)
2. **Native traversal operators in `LanceNativePlanner`** that use CSR for `Expand` and `VariableLengthExpand` instead of SQL joins
3. **Cypher-native schema** rather than YAML/JSON config (we already parse Cypher, so this is natural)
4. **In-memory CSR from Arrow tables** for the zero-ETL case (same idea as icebug-memory), sharing the same neighbor-lookup API as the persistent index

### Storage model

Edges stay a normal Lance dataset (`knows.lance`) with whatever properties the user defines -- no required row ordering. The adjacency lives as a secondary index on that dataset:

- **Offset table** -- `{vertex_id, offset, degree}`, one entry per vertex plus a sentinel
- **Neighbor list** -- `{dst_id}` ordered by source vertex

For vertex `v`, its neighbors are `neighbors[offsets[v]..offsets[v+1]]`. Direction is a build-time property of the index, not the table: `outgoing` (CSR) keys on `src`, `incoming` (CSC) is the same construction with the columns swapped (key on `dst`). The two directions are independent index instances -- the reverse adjacency is materialized by re-sorting on the destination column, since reverse edges are not contiguous in the forward layout.

At query time the index is range-loaded: for a set of source vertices we read only their offset entries and the corresponding neighbor slices via `take()`/range scans, not the whole adjacency.

### Why a native index over a sidecar

- **Atomic, versioned maintenance** -- adjacency updates commit with edge inserts through Lance's index path; no drift, no separate exclusive owner
- **Lazy, partial loading** -- only touched vertices' adjacency ranges are read, same as other Lance indices
- **Random access via `take()`** -- O(1) row lookup by ID during traversal
- **Vector search integration** -- combine graph traversal with ANN search ("find 2-hop neighbors whose embeddings are similar to X")

## Implementation plan

### Phase 1 -- CSR index builder ✅ (done -- #160)
- [x] In-memory `CsrIndex` (offset/indptr array + neighbor array)
- [x] Neighbor lookup + out-degree
- [x] BFS
- [x] Shortest path
- [x] Arrow serialization (offset table + neighbor list)
- [x] Builder sorts edges by source internally (no input ordering required)

### Phase 2 -- Native single-hop expand
- [ ] Wire in-memory CSR into `LanceNativePlanner` for single-hop `Expand`
- [ ] Outgoing (CSR) direction
- [ ] Incoming (CSC) direction via reversed-column build
- [ ] DataFusion fallback for undirected expands
- [ ] End-to-end parity tests vs. the join planner

### Phase 3 -- Multi-hop and algorithms
- [ ] Iterative CSR traversal for fixed-length multi-hop
- [ ] `VariableLengthExpand` (Kleene `*` / `+`)
- [ ] BFS operator
- [ ] DFS operator
- [ ] Shortest path operator
- [ ] Cross-iteration edge-property predicates (e.g. increasing timestamps along a path)

### Phase 4 -- Native Lance adjacency index (replaces the original sidecar-persistence plan)
- [ ] **4a -- Design + upstream discussion on the `lance` repo**
  - [ ] Open discussion proposing a new adjacency/CSR secondary index kind ([link TBD])
  - [ ] Align on the index-extension API (how to register a new index kind)
  - [ ] Agree on offset-table + neighbor-list on-disk layout
- [ ] **4b -- Implement the index in Lance**
  - [ ] New index kind over an edge dataset's `(src, dst)` columns
  - [ ] Offset table `{vertex_id, offset, degree}` + neighbor list `{dst_id}` ordered by source
  - [ ] Build/maintain through Lance's commit + index-update path (versioned, transactional)
  - [ ] Incremental update on edge inserts (atomic with edge writes)
  - [ ] Lazy, range-loaded reads via `take()` / range scans
- [ ] **4c -- Consume from lance-graph**
  - [ ] Read path through the native index
  - [ ] `outgoing` and `incoming` as two independent index instances

### Phase 5 -- Hybrid queries
- [ ] Combine CSR traversal with vector (ANN) search
- [ ] Combine CSR traversal with property filters
- [ ] Example: "find 2-hop neighbors whose embeddings are similar to X"

## References

- [GraphAr: An Open Source Library for Graph Data Storage and Retrieval](https://arxiv.org/html/2312.09577v4) (Apache incubator)
- [icebug-format: A proposal for graph standardization](https://github.com/Ladybug-Memory/icebug-format) (Ladybug-Memory)
- [Graph Archiving with Apache GraphAr](https://adsharma.github.io/graph-archiving/) (blog post comparing GraphAr vs icebug)
- [DuckPGQ: Efficient Property Graph Queries in an Analytical RDBMS](https://www.cidrdb.org/cidr2023/papers/p66-wolde.pdf) (CIDR 2023)
- [BACH: A Unified Graph Store Supporting Efficient Online Analytical Processing](https://www.vldb.org/pvldb/vol18/p1509-miao.pdf) (PVLDB 2025)
- [SQL/PGQ: Property Graph Queries in SQL](https://arxiv.org/abs/2112.06217) (SIGMOD 2022)
- [GQL Formal Semantics](https://arxiv.org/pdf/2409.01102) (VLDB 2025)
- [SQL/PGQ vs SQL Performance Analysis](https://arxiv.org/html/2505.07595v1) (2025)
- [TigerGraph: A Native MPP Graph Database](https://arxiv.org/pdf/1901.08248)
- [PostgreSQL 19 SQL/PGQ Docs](https://www.postgresql.org/docs/19/queries-graph.html)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CSR adjacency index for native graph traversal (inspired by DuckPGQ + icebug-format) #159

The problem

Prior art and what we learned

GraphAr (Apache incubator)

icebug-format (Ladybug-Memory)

DuckPGQ (CIDR 2023, peer-reviewed)

SQL/PGQ expressiveness limits

Direction

Storage model

Why a native index over a sidecar

Implementation plan

Phase 1 -- CSR index builder ✅ (done -- #160)

Phase 2 -- Native single-hop expand

Phase 3 -- Multi-hop and algorithms

Phase 4 -- Native Lance adjacency index (replaces the original sidecar-persistence plan)

Phase 5 -- Hybrid queries

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pattern type	SQL/PGQ?	Native CSR needed?
Fixed-length (2-3 hops)	Standard joins	No
Unbounded (Kleene `*`, `+`)	Recursive SQL or BFS	Yes -- recursive CTEs OOM on large graphs
Cross-iteration edge properties (e.g. increasing timestamps along paths)	Fundamentally inexpressible	Yes -- only native traversal can do this

feat: CSR adjacency index for native graph traversal (inspired by DuckPGQ + icebug-format) #159

Description

The problem

Prior art and what we learned

GraphAr (Apache incubator)

icebug-format (Ladybug-Memory)

DuckPGQ (CIDR 2023, peer-reviewed)

SQL/PGQ expressiveness limits

Direction

Storage model

Why a native index over a sidecar

Implementation plan

Phase 1 -- CSR index builder ✅ (done -- #160)

Phase 2 -- Native single-hop expand

Phase 3 -- Multi-hop and algorithms

Phase 4 -- Native Lance adjacency index (replaces the original sidecar-persistence plan)

Phase 5 -- Hybrid queries

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions