Skip to content

feat: add shortest path return to BFS and SSSP algorithms#574

Open
longbinlai wants to merge 23 commits into
alibaba:mainfrom
longbinlai:feat/gds-shortest-path-return
Open

feat: add shortest path return to BFS and SSSP algorithms#574
longbinlai wants to merge 23 commits into
alibaba:mainfrom
longbinlai:feat/gds-shortest-path-return

Conversation

@longbinlai

Copy link
Copy Markdown
Collaborator

Summary

Add shortest path return to GDS BFS and SSSP algorithms, plus path serialization optimizations that benefit all path-returning queries.

Changes

Commit 1: feat(gds): add shortest path return to BFS and SSSP algorithms

  • path output column via YIELD clause (pure YIELD-based, no config options)
  • Predecessor tracking in BFS/SSSP/BFSPred/SSSPPred (O(V) memory)
  • Real edge data lookup from CSR in path reconstruction
  • Struct-wrapped DataType for kPath compatibility with type converter
  • Zero overhead when path is not YIELDed
  • All changes confined to extension/gds/ — no system code modified

Commit 2-3: perf: optimize path JSON serialization

  • Eliminate serialize→parse→copy string round-trip in convert_path_to_json (sink.cc)
  • Add direct rapidjson build helpers: build_vertex_json_value / build_edge_json_value
  • Backward compatible: convert_vertex_to_json / convert_edge_to_json unchanged for standalone callers

Commit 4: feat(gds): add path_properties configuration

  • Lightweight mode (default): Only _ID, _LABEL, PK for nodes; _ID, _LABEL, _SRC_ID, _DST_ID for edges
    • Performance: ~0.54s on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths)
  • Full mode (path_properties: 'full'): All vertex and edge properties
    • Performance: ~1.26s (2.34x slower)
  • Thread-local encoding flag in sink.cc with public API in sink.h

Usage

-- Distance only (backward compatible, zero overhead)
CALL bfs('g', {source: '0'}) YIELD node, distance RETURN node.id, distance;

-- With path return (lightweight, default)
CALL bfs('g', {source: '0'}) YIELD node, distance, path RETURN node.id, distance, path;

-- With path return (full properties)
CALL bfs('g', {source: '0', path_properties: 'full'}) YIELD node, distance, path RETURN node.id, distance, path;

Testing

Suite Tests Result
GDS path return (test_gds_path.py) 21 ✅ all pass
GDS path encoding modes 5 ✅ all pass
LDBC SF10 path tests (test_gds_path_ldbc.py) 18 ✅ all pass
Existing GDS tests (test_gds.py) 36 ✅ all pass
LDBC Graphalytics conformance (test_graphalytics.py) 24 ✅ all pass
Other test suites (test_db_cases, test_query, etc.) 300+ ✅ no regressions

Performance (LDBC SNB SF10: 65K nodes, 1.9M edges, 62K paths)

Configuration Time vs Baseline
BFS no path 0.025s
BFS + path (lightweight) 0.54s 6.5x from no-path
BFS + path (full) 1.26s 2.34x from lightweight

Files Changed

  • extension/gds/ — 12 files (path return implementation + configuration)
  • include/neug/execution/common/operators/retrieve/sink.h — encoding mode API
  • src/execution/common/operators/retrieve/sink.cc — serialization optimization
  • tools/python_bind/tests/ — 3 new test files, 21 new tests

Add `path` output column to BFS and SSSP GDS algorithms, allowing users
to retrieve the actual shortest path (vertex + edge sequence) via the
standard Cypher YIELD clause.

## Cypher API

```cypher
-- Distance only (backward compatible, zero overhead)
CALL bfs('g', {source: '0'}) YIELD node, distance RETURN node.id, distance;

-- With path return (path computation triggered by YIELDing path)
CALL bfs('g', {source: '0'}) YIELD node, distance, path
RETURN node.id, distance, path;
```

## Key Design Decisions

- **Pure YIELD-based API**: no config options; path computation only
  happens when the `path` column is explicitly YIELDed
- **Zero overhead when not requested**: predecessor array is not
  allocated when path is not in the YIELD clause
- **Native kPath type**: returns standard NeuG Path objects with real
  edge data pointers looked up from the CSR graph view, supporting all
  Cypher path functions (nodes(), relationships(), length())
- **Struct-wrapped DataType**: the path column uses a properly
  constructed StructTypeInfo (with _NODES and _RELS fields) to match
  the type converter's expectations, via a bind-time wrapper

## Implementation

- Adds `predecessors_[]` array to BFS/BFSPred/SSSP/SSSPPred (O(V) memory)
- BFS parallel: predecessor write inside CAS-success branch (no extra sync)
- SSSP parallel: predecessor write after successful distance relaxation
- Sequential variants (pred): plain writes, no atomics needed
- Path reconstruction in sink() walks predecessor chain, looks up real
  edge data from CSR, builds PathColumn via PathColumnBuilder
- New `path_utils.h` with `reconstruct_path()`, `build_path_from_chain()`,
  `buildPathDataType()`, and `wrapTableBindFuncWithPathFix()`

## Testing

- 16 new path return tests (test_gds_path.py): basic, backward compat,
  predicates, custom graphs, edge cases
- All 76 existing tests pass (24 Graphalytics + 36 GDS + 16 path)
- LDBC Graphalytics conformance verified with and without path return
The convert_path_to_json function was serializing each vertex/edge to a
JSON string, then parsing it back into a rapidjson Document, then copying
it into the parent array — three operations where one suffices.

Add build_vertex_json_value and build_edge_json_value helpers that
construct rapidjson objects directly into the provided allocator, and
have convert_path_to_json use them instead of the string-based
convert_vertex_to_json / convert_edge_to_json.

The original string-returning functions are kept unchanged for their
standalone callers in add_column (kVertex/kEdge cases).

Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths):
  BFS with path return: 1.63s → 1.22s (25% faster)
…ncoding

Two optimizations to convert_path_to_json in sink.cc:

1. Direct rapidjson build: replace the serialize→parse→copy string
   round-trip with build_vertex_json_value / build_edge_json_value
   helpers that construct rapidjson objects directly into the allocator.

2. Lightweight path encoding: path output now encodes only _ID, _LABEL,
   and PK for nodes (plus _SRC_ID, _DST_ID for edges), skipping all
   non-PK property lookups. Users can retrieve full node/edge details
   with a separate MATCH query when needed.

Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths, 3-round avg):

  BFS yield+return path:  1.63s → 0.50s  (3.3x faster, ↓ 70%)

  Breakdown of improvement:
    - String round-trip elimination:  1.63s → 1.31s  (↓ 20%)
    - Lightweight encoding (skip property I/O):  1.31s → 0.50s  (↓ 62%)

All 94 GDS tests pass. No regressions in other test suites.
…path encoding

Add configurable path encoding mode to control performance vs completeness:

- **Lightweight mode (default)**: Only encodes structural info (_ID, _LABEL, PK for nodes; _ID, _LABEL, _SRC_ID, _DST_ID for edges)
  - Performance: ~0.54s on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths)

- **Full mode**: Encodes all vertex and edge properties
  - Performance: ~1.26s (2.34x slower than lightweight)

Usage:
  CALL bfs('graph', {source: '0'})  -- lightweight (default)
  CALL bfs('graph', {source: '0', path_properties: 'full'})  -- full

Implementation:
- Add thread-local flag in sink.cc to control encoding mode
- Add set_path_full_encoding()/get_path_full_encoding() API in sink.h
- Add configure_path_encoding() helper in path_utils.h
- Parse path_properties option in bfs.cc and sssp.cc
- Update build_vertex_json_value() and build_edge_json_value() to check flag
- Add 5 new tests in test_gds_path.py::TestPathEncodingModes
- Update spec document with configuration details

All 57 GDS tests pass. No regressions in other test suites.
@CLAassistant

CLAassistant commented Jun 20, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@longbinlai longbinlai changed the title feat(gds): add shortest path return to BFS and SSSP algorithms feat: add shortest path return to BFS and SSSP algorithms Jun 20, 2026
- Add 'Shortest Path Return' section to doc/source/extensions/load_gds.md
- Document path_properties option (lightweight/full modes)
- Update BFS and SSSP sections with path examples
- Update Algorithm Summary table
- Remove technical spec file (not for users)
@longbinlai longbinlai force-pushed the feat/gds-shortest-path-return branch from 9b83082 to 0262596 Compare June 20, 2026 04:54
….1 + clang-format 10.0.1)

- Remove build_vertex_json_value_light and build_edge_json_value_light
  (defined but not used, causes -Werror=unused-function in CI)
- Re-run isort with pinned version 5.10.1 (CI uses 5.10.1, not 6.x)
- Re-run black on Python test files
@longbinlai longbinlai force-pushed the feat/gds-shortest-path-return branch from f394613 to a59f3d6 Compare June 20, 2026 08:54
…TCH p = ... queries

The previous lightweight default broke C++ tests that use MATCH p = (a)-[*]->(b)
queries, which expect full property encoding. This commit:

- Changes default path encoding to 'full' mode (backward compatible)
- Users can opt into 'lightweight' mode via path_properties option for better performance
- Updates GDS BFS/SSSP algorithms to use 'full' as default
- Updates tests to verify full mode is default
- Updates documentation to reflect the change
- Reorder BFS examples: default full path first, explicit full second,
  lightweight last as performance optimization
- Update performance recommendations to describe full mode as default
  and lightweight as opt-in for structure-only use cases
- Remove internal design spec (extension/gds/docs/gds-shortest-path-spec.md)

Closes alibaba#575
@longbinlai longbinlai force-pushed the feat/gds-shortest-path-return branch from f766964 to 4a1729e Compare June 21, 2026 03:54
@longbinlai longbinlai requested a review from liulx20 June 22, 2026 02:11
… doc links

- Fix project_graph() syntax (was missing triplet map format)
- Remove reference to non-existent docs/gds-shortest-path-spec.md
- Point to existing doc/source/extensions/load_gds.md for full docs
- List all 9 algorithms in a concise table
- Fix build commands to use repo root Makefile
Comment thread extension/gds/src/impl/bfs_impl.cc Outdated
if (__atomic_compare_exchange_n(&distances_[dst], &expected,
level, false, __ATOMIC_RELAXED,
__ATOMIC_RELAXED)) {
if (return_path_) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether there are multi-threading concurrency issues here. A more reasonable approach would be: after the BFS completes, re-traverse from each node—for example, starting from nodes at depth d to find their neighbors at depth d-1, continuing until reaching the target node.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. You are right that the dense pull mode had a race condition on predecessors_.

Fix applied (commit 3dbc758): In the dense pull mode, I added an atomic CAS on distances_[dst] before writing predecessors_[dst]. The flow is now:

  1. Check if the vertex is unvisited (distances_[dst] == max)
  2. Scan neighbors for a predecessor at level - 1
  3. If reachable, CAS distances_[dst] from max to level — only the winning thread writes predecessors_[dst]

This ensures only one thread can claim each vertex and set its predecessor, eliminating the race. The sparse push mode was already safe (CAS on distance was already there), so no change was needed there.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! We've adopted exactly this approach in the latest commit. Predecessor arrays are now completely removed from BFS computation. Paths are reconstructed post-hoc in sink() by walking the distance array: for vertex v at distance d, scan neighbors to find one at distance d-1, repeat until reaching source. This eliminates the concurrency hazard entirely — no CAS needed, no predecessor writes during parallel computation. The dense pull mode has been reverted to the original simple non-atomic distances_[dst] = level.

double cand = dist + weight;
if (distances_[w] < 0 || cand < distances_[w]) {
distances_[w] = cand;
if (return_path_) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to BFS, it may be difficult to trace back predecessor nodes in SSSP. A simple solution is to implement a single-threaded version to maintain the predecessors.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider routing the case that returns a path through the predicate branch, since it is inherently single-threaded.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Adopted your suggestion — when path return is requested (return_path=true), the dispatcher in sssp.cc now routes to SSSPPred (single-threaded Dijkstra) regardless of whether predicates are present. This completely avoids the concurrency hazard of maintaining predecessor chains in the parallel SSSP, where iterative relaxation by multiple threads makes predecessor tracking non-trivial.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly what I did — the dispatcher in sssp.cc now checks return_path and routes to the predicate branch (SSSPPred, single-threaded Dijkstra) when path return is requested. This is inherently safe since there is no concurrent access to the predecessor array. All 95 Python tests pass.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. SSSPPred no longer stores predecessors during computation. Path reconstruction is done post-hoc in sink() by walking the distance array: for vertex v with distance D, find a neighbor u where dist(u) + weight(u→v) == dist(v). This is single-threaded and avoids the difficulty of maintaining predecessor chains during parallel relaxation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we route return_path cases through SSSPPred (single-threaded Dijkstra) as suggested. SSSPPred now also uses post-hoc path reconstruction instead of a predecessor array, so the routing is kept for the predicate case and the path return case. The parallel SSSP is only used when no predicates and no path return.

Address review comments from liulx20 on PR alibaba#574 regarding multi-threading
concurrency issues in predecessor tracking.

BFS (bfs_impl.cc):
- Dense pull mode: use atomic CAS for distance update to ensure only
  one thread writes predecessors_[dst], eliminating the race where
  multiple threads could simultaneously claim the same unvisited vertex
  and overwrite each other predecessor entry.

SSSP (sssp.cc):
- Route path return through SSSPPred (single-threaded Dijkstra) when
  return_path=true, regardless of whether predicates are present.
  This avoids the inherent difficulty of maintaining correct predecessor
  chains in the parallel SSSP iterative relaxation.

Also fix tarfile DeprecationWarning in io_utils.py (filter=data).
This reverts commit 91039e0.
Add test_gds_concurrency.py with 19 tests validating the concurrency fix
(3dbc758) on a 20K-vertex graph that triggers multi-threaded parallel_for
and dense pull mode. Tests verify BFS distance determinism, path validity
(correct length, valid edges, no cycles, correct endpoints), and SSSP
full determinism (single-threaded SSSPPred routing).

Move standalone benchmark scripts from tools/python_bind/tests/ to
benchmark/:
- bench_gds_path.py (was tests/bench_gds_path.py)
- bench_gds_algorithms.py (was tests/test_gds_benchmark.py)
- bench_gds_performance.py (was tests/test_gds_performance.py)
- bench_wcc_comparison.py (was tests/test_wcc_comparison.py)

Add test_gds_path.py and test_gds_concurrency.py to the extension test
CI workflow.
Remove bench_gds_algorithms.py, bench_gds_performance.py, and
bench_wcc_comparison.py from benchmark/ — only bench_gds_path.py
should be moved. Apply isort+black formatting to test_gds_concurrency.py.
Keep concurrency as the last option consistently across all algorithm
tables.

from neug.database import Database

DATA_DIR = os.path.expanduser("~/Downloads/social_network_10_interactive")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid using absolute paths here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced the hardcoded ~/Downloads/... path with an environment variable NEUG_LDBC_DATA_DIR. The test auto-skips when the variable is not set.

// - lightweight: only encode _ID, _LABEL, PK for vertices; structural info for
// edges
// GDS extension can call set_path_full_encoding(false) for lightweight mode
static thread_local bool path_full_encoding_enabled = true;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope of this, and can it be declared outside of sink.cc?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable now affects the sink behavior of other queries, which is unreasonable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global variable has been completely removed. The encoding mode is now carried per-column: PathColumnBuilder has a set_full_encoding(bool) setter, PathColumn carries the flag, and sink.cc's add_column() reads it from the column when serializing. GDS sink() methods receive full_encoding as a parameter and set it on the builder. Non-GDS path columns default to true (full encoding), so other queries are never affected.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully resolved. The static thread_local bool path_full_encoding_enabled variable has been completely removed from sink.cc. The encoding mode is now carried as a per-column member on PathColumn/PathColumnBuilder, so GDS path queries with path_properties: 'lightweight' only affect their own path column. Other queries' path columns always default to full encoding and are never affected.

Comment thread extension/gds/src/sssp.cc
{"node", common::DataTypeId::kVertex},
{"distance", common::DataTypeId::kDouble}};
{"distance", common::DataTypeId::kDouble},
{"path", common::DataTypeId::kPath}};

@liulx20 liulx20 Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function declaration here returns the column for the path, yet it doesn't return anything when return path is false, which seems odd. Is there a better way to handle this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3-column output declaration (node, distance, path) represents the function's schema — the maximum set of columns the function CAN produce. The YIELD clause selects which columns are actually returned. When the user doesn't YIELD path, return_path is false and the path column is simply not populated in the output Context. This is standard Cypher CALL behavior — the output columns declare what's available, not what's always produced. We kept this design as changing it would require modifications to the function binding infrastructure.

Comment thread extension/gds/include/impl/bfs_impl.h Outdated
void compute();
void sink(execution::Context& ctx, int node_alias, int distance_alias);
void sink(execution::Context& ctx, int node_alias, int distance_alias,
int path_alias = -1);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, -1 is a valid alias, so we probably shouldn't use -1 to indicate whether the path should be output.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The path_alias >= 0 check has been removed. All sink() methods now use return_path_ as the sole indicator for whether to build the path column. When return_path_ is true, path_alias is always valid (set by the binder).

Comment thread extension/gds/src/sssp.cc Outdated
SSSP sssp(graph, sssp_input.vertex_label, sssp_input.edge_label, source_vid,
sssp_input.directed, sssp_input.edge_weight,
sssp_input.concurrency);
sssp_input.concurrency, sssp_input.return_path);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the else branch does not handle return_path, the logic related to return_path can be removed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The else branch no longer passes return_path or path_alias to the parallel SSSP. The return_path_ and predecessors_ members have been completely removed from the SSSP class, along with the dead if (return_path_) branches in compute() and sink(). The parallel SSSP constructor no longer takes a return_path parameter.

Address review comment from liulx20: use NEUG_LDBC_DATA_DIR environment
variable instead of hardcoded ~/Downloads path. Test skips when the
variable is not set.
…rays

Address liulx20's review comments on PR alibaba#574:

- Remove predecessor arrays from BFS, BFSPred, SSSP, SSSPPred. Paths are
  now reconstructed post-hoc in sink() by walking the distance array
  backward: for vertex v at distance d, find a neighbor u at distance d-1.
  This eliminates the concurrency hazard entirely (path reconstruction is
  single-threaded).

- BFS dense pull mode: revert CAS to simple non-atomic write (CAS was only
  needed to protect predecessor writes, which are now gone).

- SSSP parallel: remove return_path_ and predecessors_ entirely (parallel
  SSSP never produces paths; path return routes through SSSPPred).

- Remove configure_path_encoding() wrapper; call set_path_full_encoding()
  directly in bfs.cc/sssp.cc and reset to true (default) after sink() to
  prevent the thread-local flag from leaking to other queries.

- Use return_path_ instead of path_alias >= 0 as the path output indicator
  (-1 is a valid alias per liulx20's comment).

- Remove PlainPredecessorAccessor; reconstruct_path() now accepts a
  callable for finding predecessors on-the-fly.

Performance verified on Graph500-23 (4.6M vertices, 129M edges):
  BFS with path: 38.2s (original) → 37.1s (refactor)
  SSSP with path: 47.4s (original) → 46.0s (refactor)
  No regression; slight improvement from removed CAS overhead.
Address liulx20's comment: "this variable now affects the sink behavior
of other queries, which is unreasonable."

Remove the thread_local global variable path_full_encoding_enabled from
sink.cc. Instead, the encoding mode is carried as a member of
PathColumn/PathColumnBuilder:
- PathColumnBuilder::set_full_encoding(bool) sets the mode
- PathColumnBuilder::finish() propagates it to PathColumn
- PathColumn::full_encoding() reads it
- sink.cc's add_column() reads it from the column and passes to
  convert_path_to_json()

GDS sink() methods receive full_encoding as a parameter and set it on
the PathColumnBuilder. Non-GDS path columns default to true (full
encoding), so other queries are never affected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants