Skip to content

feat: CSR adjacency index for native graph traversal (inspired by DuckPGQ + icebug-format) #159

Description

@jja725

The problem

Today lance-graph translates every Cypher query into SQL joins through DataFusion. A simple (a)-[:KNOWS]->(b)-[:LIKES]->(c) becomes a chain of self-joins that scales as O(N^k) for k hops. The LanceNativePlanner exists as a placeholder but doesn't do anything yet.

We need a CSR (Compressed Sparse Row) adjacency index so that graph traversals can do direct neighbor lookups instead of join scans.

Prior art and what we learned

We looked at three approaches and landed on a direction that combines ideas from DuckPGQ and icebug-format.

GraphAr (Apache incubator)

GraphAr showed that you can encode CSR offset tables directly in Parquet files, giving you adjacency-list access without SQL joins. The spec says: "the sorted adjList table together with the offset table can be used as CSR, supporting fast access of outgoing edges for a given vertex."

However, GraphAr is fundamentally a data interchange format -- it's designed for archiving and bulk loading into graph databases, not for online querying. It uses a deeply nested directory structure with YAML manifests, which adds overhead when you want to mount and query immediately.

icebug-format (Ladybug-Memory)

icebug-format takes a different stance that's closer to what we want. Key design choices:

  • Cypher schema as catalog instead of YAML/JSON config -- a schema.cypher file like CREATE NODE TABLE nodes(id INT64, club STRING, PRIMARY KEY(id)); CREATE REL TABLE edges(FROM nodes TO nodes); that a graph database can mount directly
  • Flat file layout -- three Parquet files per edge type (nodes_<name>.parquet, indices_<name>.parquet, indptr_<name>.parquet) instead of GraphAr's nested directories
  • Two flavors: icebug-disk (Parquet on object storage) and icebug-memory (Arrow tables, zero-copy)
  • DuckDB-native CSR construction -- builds the CSR using DuckDB SQL, no custom binary tooling
  • Designed like native tables in a graph database -- less generality than GraphAr, but you can mount it and query instantly

The philosophy is similar to DuckLake: keep metadata in the system catalog of the database, not in sidecar config files.

DuckPGQ (CIDR 2023, peer-reviewed)

DuckPGQ validated that the relational engine + CSR acceleration pattern works well in practice:

  • Rewrites SQL/PGQ graph patterns into standard SQL joins (same as what we do with DataFusion)
  • Adds vectorized CSR-based UDFs for path-finding (Multi-Source BFS with SIMD/AVX-512)
  • Creates CSR on-the-fly from relational tables using ROWIDs -- no separate graph storage engine
  • On LDBC LSQB benchmarks (SF10-SF100): within an order of magnitude of Umbra, consistently outperforms Neo4j (which fails 4 queries on SF100)
  • BFS scales to SF3000 (9M vertices, 670M edges), while Umbra's recursive SQL OOMs on SF10

The key takeaway: you don't need a separate graph database. A relational engine with CSR acceleration handles both pattern matching (via joins) and path-finding (via native BFS on CSR).

SQL/PGQ expressiveness limits

Our research also surfaced where SQL/PGQ breaks down, which tells us what the native CSR engine must handle:

Pattern type SQL/PGQ? Native CSR needed?
Fixed-length (2-3 hops) Standard joins No
Unbounded (Kleene *, +) Recursive SQL or BFS Yes -- recursive CTEs OOM on large graphs
Cross-iteration edge properties (e.g. increasing timestamps along paths) Fundamentally inexpressible Yes -- only native traversal can do this

Direction

We think the right approach for lance-graph combines ideas from DuckPGQ and icebug-format, with one important revision based on discussion in this thread: the CSR adjacency structure should be a native Lance index kind, not a sidecar file that lance-graph maintains on the side.

The original sketch (CSR stored as separate Lance datasets in a flat layout) has the consistency and ownership problems @aheev raised: who keeps the adjacency in sync with edge writes, who owns the dataset, and what happens under concurrent access. Making CSR a first-class Lance index removes all of that -- index maintenance rides on Lance's own commit + index-update path, tied to a dataset version, the same way scalar/vector/FTS indices work today. lance-graph doesn't need exclusive access; Lance owns the index lifecycle and lance-graph just reads it.

  1. CSR/adjacency as a native Lance secondary index -- a new index kind over an edge dataset's (src, dst) columns, built and updated through Lance's existing index infrastructure (versioned, transactional, lazily loaded at query time)
  2. Native traversal operators in LanceNativePlanner that use CSR for Expand and VariableLengthExpand instead of SQL joins
  3. Cypher-native schema rather than YAML/JSON config (we already parse Cypher, so this is natural)
  4. In-memory CSR from Arrow tables for the zero-ETL case (same idea as icebug-memory), sharing the same neighbor-lookup API as the persistent index

Storage model

Edges stay a normal Lance dataset (knows.lance) with whatever properties the user defines -- no required row ordering. The adjacency lives as a secondary index on that dataset:

  • Offset table -- {vertex_id, offset, degree}, one entry per vertex plus a sentinel
  • Neighbor list -- {dst_id} ordered by source vertex

For vertex v, its neighbors are neighbors[offsets[v]..offsets[v+1]]. Direction is a build-time property of the index, not the table: outgoing (CSR) keys on src, incoming (CSC) is the same construction with the columns swapped (key on dst). The two directions are independent index instances -- the reverse adjacency is materialized by re-sorting on the destination column, since reverse edges are not contiguous in the forward layout.

At query time the index is range-loaded: for a set of source vertices we read only their offset entries and the corresponding neighbor slices via take()/range scans, not the whole adjacency.

Why a native index over a sidecar

  • Atomic, versioned maintenance -- adjacency updates commit with edge inserts through Lance's index path; no drift, no separate exclusive owner
  • Lazy, partial loading -- only touched vertices' adjacency ranges are read, same as other Lance indices
  • Random access via take() -- O(1) row lookup by ID during traversal
  • Vector search integration -- combine graph traversal with ANN search ("find 2-hop neighbors whose embeddings are similar to X")

Implementation plan

Phase 1 -- CSR index builder ✅ (done -- #160)

  • In-memory CsrIndex (offset/indptr array + neighbor array)
  • Neighbor lookup + out-degree
  • BFS
  • Shortest path
  • Arrow serialization (offset table + neighbor list)
  • Builder sorts edges by source internally (no input ordering required)

Phase 2 -- Native single-hop expand

  • Wire in-memory CSR into LanceNativePlanner for single-hop Expand
  • Outgoing (CSR) direction
  • Incoming (CSC) direction via reversed-column build
  • DataFusion fallback for undirected expands
  • End-to-end parity tests vs. the join planner

Phase 3 -- Multi-hop and algorithms

  • Iterative CSR traversal for fixed-length multi-hop
  • VariableLengthExpand (Kleene * / +)
  • BFS operator
  • DFS operator
  • Shortest path operator
  • Cross-iteration edge-property predicates (e.g. increasing timestamps along a path)

Phase 4 -- Native Lance adjacency index (replaces the original sidecar-persistence plan)

  • 4a -- Design + upstream discussion on the lance repo
    • Open discussion proposing a new adjacency/CSR secondary index kind ([link TBD])
    • Align on the index-extension API (how to register a new index kind)
    • Agree on offset-table + neighbor-list on-disk layout
  • 4b -- Implement the index in Lance
    • New index kind over an edge dataset's (src, dst) columns
    • Offset table {vertex_id, offset, degree} + neighbor list {dst_id} ordered by source
    • Build/maintain through Lance's commit + index-update path (versioned, transactional)
    • Incremental update on edge inserts (atomic with edge writes)
    • Lazy, range-loaded reads via take() / range scans
  • 4c -- Consume from lance-graph
    • Read path through the native index
    • outgoing and incoming as two independent index instances

Phase 5 -- Hybrid queries

  • Combine CSR traversal with vector (ANN) search
  • Combine CSR traversal with property filters
  • Example: "find 2-hop neighbors whose embeddings are similar to X"

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions