feat: add shortest path return to BFS and SSSP algorithms#574
feat: add shortest path return to BFS and SSSP algorithms#574longbinlai wants to merge 23 commits into
Conversation
Add `path` output column to BFS and SSSP GDS algorithms, allowing users
to retrieve the actual shortest path (vertex + edge sequence) via the
standard Cypher YIELD clause.
## Cypher API
```cypher
-- Distance only (backward compatible, zero overhead)
CALL bfs('g', {source: '0'}) YIELD node, distance RETURN node.id, distance;
-- With path return (path computation triggered by YIELDing path)
CALL bfs('g', {source: '0'}) YIELD node, distance, path
RETURN node.id, distance, path;
```
## Key Design Decisions
- **Pure YIELD-based API**: no config options; path computation only
happens when the `path` column is explicitly YIELDed
- **Zero overhead when not requested**: predecessor array is not
allocated when path is not in the YIELD clause
- **Native kPath type**: returns standard NeuG Path objects with real
edge data pointers looked up from the CSR graph view, supporting all
Cypher path functions (nodes(), relationships(), length())
- **Struct-wrapped DataType**: the path column uses a properly
constructed StructTypeInfo (with _NODES and _RELS fields) to match
the type converter's expectations, via a bind-time wrapper
## Implementation
- Adds `predecessors_[]` array to BFS/BFSPred/SSSP/SSSPPred (O(V) memory)
- BFS parallel: predecessor write inside CAS-success branch (no extra sync)
- SSSP parallel: predecessor write after successful distance relaxation
- Sequential variants (pred): plain writes, no atomics needed
- Path reconstruction in sink() walks predecessor chain, looks up real
edge data from CSR, builds PathColumn via PathColumnBuilder
- New `path_utils.h` with `reconstruct_path()`, `build_path_from_chain()`,
`buildPathDataType()`, and `wrapTableBindFuncWithPathFix()`
## Testing
- 16 new path return tests (test_gds_path.py): basic, backward compat,
predicates, custom graphs, edge cases
- All 76 existing tests pass (24 Graphalytics + 36 GDS + 16 path)
- LDBC Graphalytics conformance verified with and without path return
The convert_path_to_json function was serializing each vertex/edge to a JSON string, then parsing it back into a rapidjson Document, then copying it into the parent array — three operations where one suffices. Add build_vertex_json_value and build_edge_json_value helpers that construct rapidjson objects directly into the provided allocator, and have convert_path_to_json use them instead of the string-based convert_vertex_to_json / convert_edge_to_json. The original string-returning functions are kept unchanged for their standalone callers in add_column (kVertex/kEdge cases). Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths): BFS with path return: 1.63s → 1.22s (25% faster)
…ncoding
Two optimizations to convert_path_to_json in sink.cc:
1. Direct rapidjson build: replace the serialize→parse→copy string
round-trip with build_vertex_json_value / build_edge_json_value
helpers that construct rapidjson objects directly into the allocator.
2. Lightweight path encoding: path output now encodes only _ID, _LABEL,
and PK for nodes (plus _SRC_ID, _DST_ID for edges), skipping all
non-PK property lookups. Users can retrieve full node/edge details
with a separate MATCH query when needed.
Measured on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths, 3-round avg):
BFS yield+return path: 1.63s → 0.50s (3.3x faster, ↓ 70%)
Breakdown of improvement:
- String round-trip elimination: 1.63s → 1.31s (↓ 20%)
- Lightweight encoding (skip property I/O): 1.31s → 0.50s (↓ 62%)
All 94 GDS tests pass. No regressions in other test suites.
…path encoding
Add configurable path encoding mode to control performance vs completeness:
- **Lightweight mode (default)**: Only encodes structural info (_ID, _LABEL, PK for nodes; _ID, _LABEL, _SRC_ID, _DST_ID for edges)
- Performance: ~0.54s on LDBC SNB SF10 (65K nodes, 1.9M edges, 62K paths)
- **Full mode**: Encodes all vertex and edge properties
- Performance: ~1.26s (2.34x slower than lightweight)
Usage:
CALL bfs('graph', {source: '0'}) -- lightweight (default)
CALL bfs('graph', {source: '0', path_properties: 'full'}) -- full
Implementation:
- Add thread-local flag in sink.cc to control encoding mode
- Add set_path_full_encoding()/get_path_full_encoding() API in sink.h
- Add configure_path_encoding() helper in path_utils.h
- Parse path_properties option in bfs.cc and sssp.cc
- Update build_vertex_json_value() and build_edge_json_value() to check flag
- Add 5 new tests in test_gds_path.py::TestPathEncodingModes
- Update spec document with configuration details
All 57 GDS tests pass. No regressions in other test suites.
- Add 'Shortest Path Return' section to doc/source/extensions/load_gds.md - Document path_properties option (lightweight/full modes) - Update BFS and SSSP sections with path examples - Update Algorithm Summary table - Remove technical spec file (not for users)
9b83082 to
0262596
Compare
….1 + clang-format 10.0.1) - Remove build_vertex_json_value_light and build_edge_json_value_light (defined but not used, causes -Werror=unused-function in CI) - Re-run isort with pinned version 5.10.1 (CI uses 5.10.1, not 6.x) - Re-run black on Python test files
f394613 to
a59f3d6
Compare
…TCH p = ... queries The previous lightweight default broke C++ tests that use MATCH p = (a)-[*]->(b) queries, which expect full property encoding. This commit: - Changes default path encoding to 'full' mode (backward compatible) - Users can opt into 'lightweight' mode via path_properties option for better performance - Updates GDS BFS/SSSP algorithms to use 'full' as default - Updates tests to verify full mode is default - Updates documentation to reflect the change
- Reorder BFS examples: default full path first, explicit full second, lightweight last as performance optimization - Update performance recommendations to describe full mode as default and lightweight as opt-in for structure-only use cases - Remove internal design spec (extension/gds/docs/gds-shortest-path-spec.md) Closes alibaba#575
f766964 to
4a1729e
Compare
… doc links - Fix project_graph() syntax (was missing triplet map format) - Remove reference to non-existent docs/gds-shortest-path-spec.md - Point to existing doc/source/extensions/load_gds.md for full docs - List all 9 algorithms in a concise table - Fix build commands to use repo root Makefile
| if (__atomic_compare_exchange_n(&distances_[dst], &expected, | ||
| level, false, __ATOMIC_RELAXED, | ||
| __ATOMIC_RELAXED)) { | ||
| if (return_path_) { |
There was a problem hiding this comment.
I'm not sure whether there are multi-threading concurrency issues here. A more reasonable approach would be: after the BFS completes, re-traverse from each node—for example, starting from nodes at depth d to find their neighbors at depth d-1, continuing until reaching the target node.
There was a problem hiding this comment.
Thanks for pointing this out. You are right that the dense pull mode had a race condition on predecessors_.
Fix applied (commit 3dbc758): In the dense pull mode, I added an atomic CAS on distances_[dst] before writing predecessors_[dst]. The flow is now:
- Check if the vertex is unvisited (
distances_[dst] == max) - Scan neighbors for a predecessor at
level - 1 - If reachable, CAS
distances_[dst]frommaxtolevel— only the winning thread writespredecessors_[dst]
This ensures only one thread can claim each vertex and set its predecessor, eliminating the race. The sparse push mode was already safe (CAS on distance was already there), so no change was needed there.
There was a problem hiding this comment.
Good suggestion! We've adopted exactly this approach in the latest commit. Predecessor arrays are now completely removed from BFS computation. Paths are reconstructed post-hoc in sink() by walking the distance array: for vertex v at distance d, scan neighbors to find one at distance d-1, repeat until reaching source. This eliminates the concurrency hazard entirely — no CAS needed, no predecessor writes during parallel computation. The dense pull mode has been reverted to the original simple non-atomic distances_[dst] = level.
| double cand = dist + weight; | ||
| if (distances_[w] < 0 || cand < distances_[w]) { | ||
| distances_[w] = cand; | ||
| if (return_path_) { |
There was a problem hiding this comment.
Similar to BFS, it may be difficult to trace back predecessor nodes in SSSP. A simple solution is to implement a single-threaded version to maintain the predecessors.
There was a problem hiding this comment.
consider routing the case that returns a path through the predicate branch, since it is inherently single-threaded.
There was a problem hiding this comment.
Good point. Adopted your suggestion — when path return is requested (return_path=true), the dispatcher in sssp.cc now routes to SSSPPred (single-threaded Dijkstra) regardless of whether predicates are present. This completely avoids the concurrency hazard of maintaining predecessor chains in the parallel SSSP, where iterative relaxation by multiple threads makes predecessor tracking non-trivial.
There was a problem hiding this comment.
Exactly what I did — the dispatcher in sssp.cc now checks return_path and routes to the predicate branch (SSSPPred, single-threaded Dijkstra) when path return is requested. This is inherently safe since there is no concurrent access to the predecessor array. All 95 Python tests pass.
There was a problem hiding this comment.
Addressed. SSSPPred no longer stores predecessors during computation. Path reconstruction is done post-hoc in sink() by walking the distance array: for vertex v with distance D, find a neighbor u where dist(u) + weight(u→v) == dist(v). This is single-threaded and avoids the difficulty of maintaining predecessor chains during parallel relaxation.
There was a problem hiding this comment.
Yes, we route return_path cases through SSSPPred (single-threaded Dijkstra) as suggested. SSSPPred now also uses post-hoc path reconstruction instead of a predecessor array, so the routing is kept for the predicate case and the path return case. The parallel SSSP is only used when no predicates and no path return.
Address review comments from liulx20 on PR alibaba#574 regarding multi-threading concurrency issues in predecessor tracking. BFS (bfs_impl.cc): - Dense pull mode: use atomic CAS for distance update to ensure only one thread writes predecessors_[dst], eliminating the race where multiple threads could simultaneously claim the same unvisited vertex and overwrite each other predecessor entry. SSSP (sssp.cc): - Route path return through SSSPPred (single-threaded Dijkstra) when return_path=true, regardless of whether predicates are present. This avoids the inherent difficulty of maintaining correct predecessor chains in the parallel SSSP iterative relaxation. Also fix tarfile DeprecationWarning in io_utils.py (filter=data).
This reverts commit 91039e0.
Add test_gds_concurrency.py with 19 tests validating the concurrency fix (3dbc758) on a 20K-vertex graph that triggers multi-threaded parallel_for and dense pull mode. Tests verify BFS distance determinism, path validity (correct length, valid edges, no cycles, correct endpoints), and SSSP full determinism (single-threaded SSSPPred routing). Move standalone benchmark scripts from tools/python_bind/tests/ to benchmark/: - bench_gds_path.py (was tests/bench_gds_path.py) - bench_gds_algorithms.py (was tests/test_gds_benchmark.py) - bench_gds_performance.py (was tests/test_gds_performance.py) - bench_wcc_comparison.py (was tests/test_wcc_comparison.py) Add test_gds_path.py and test_gds_concurrency.py to the extension test CI workflow.
Remove bench_gds_algorithms.py, bench_gds_performance.py, and bench_wcc_comparison.py from benchmark/ — only bench_gds_path.py should be moved. Apply isort+black formatting to test_gds_concurrency.py.
Keep concurrency as the last option consistently across all algorithm tables.
|
|
||
| from neug.database import Database | ||
|
|
||
| DATA_DIR = os.path.expanduser("~/Downloads/social_network_10_interactive") |
There was a problem hiding this comment.
Should we avoid using absolute paths here?
There was a problem hiding this comment.
Fixed. Replaced the hardcoded ~/Downloads/... path with an environment variable NEUG_LDBC_DATA_DIR. The test auto-skips when the variable is not set.
| // - lightweight: only encode _ID, _LABEL, PK for vertices; structural info for | ||
| // edges | ||
| // GDS extension can call set_path_full_encoding(false) for lightweight mode | ||
| static thread_local bool path_full_encoding_enabled = true; |
There was a problem hiding this comment.
What is the scope of this, and can it be declared outside of sink.cc?
There was a problem hiding this comment.
this variable now affects the sink behavior of other queries, which is unreasonable.
There was a problem hiding this comment.
The global variable has been completely removed. The encoding mode is now carried per-column: PathColumnBuilder has a set_full_encoding(bool) setter, PathColumn carries the flag, and sink.cc's add_column() reads it from the column when serializing. GDS sink() methods receive full_encoding as a parameter and set it on the builder. Non-GDS path columns default to true (full encoding), so other queries are never affected.
There was a problem hiding this comment.
Fully resolved. The static thread_local bool path_full_encoding_enabled variable has been completely removed from sink.cc. The encoding mode is now carried as a per-column member on PathColumn/PathColumnBuilder, so GDS path queries with path_properties: 'lightweight' only affect their own path column. Other queries' path columns always default to full encoding and are never affected.
| {"node", common::DataTypeId::kVertex}, | ||
| {"distance", common::DataTypeId::kDouble}}; | ||
| {"distance", common::DataTypeId::kDouble}, | ||
| {"path", common::DataTypeId::kPath}}; |
There was a problem hiding this comment.
The function declaration here returns the column for the path, yet it doesn't return anything when return path is false, which seems odd. Is there a better way to handle this?
There was a problem hiding this comment.
The 3-column output declaration (node, distance, path) represents the function's schema — the maximum set of columns the function CAN produce. The YIELD clause selects which columns are actually returned. When the user doesn't YIELD path, return_path is false and the path column is simply not populated in the output Context. This is standard Cypher CALL behavior — the output columns declare what's available, not what's always produced. We kept this design as changing it would require modifications to the function binding infrastructure.
| void compute(); | ||
| void sink(execution::Context& ctx, int node_alias, int distance_alias); | ||
| void sink(execution::Context& ctx, int node_alias, int distance_alias, | ||
| int path_alias = -1); |
There was a problem hiding this comment.
Actually, -1 is a valid alias, so we probably shouldn't use -1 to indicate whether the path should be output.
There was a problem hiding this comment.
Fixed. The path_alias >= 0 check has been removed. All sink() methods now use return_path_ as the sole indicator for whether to build the path column. When return_path_ is true, path_alias is always valid (set by the binder).
| SSSP sssp(graph, sssp_input.vertex_label, sssp_input.edge_label, source_vid, | ||
| sssp_input.directed, sssp_input.edge_weight, | ||
| sssp_input.concurrency); | ||
| sssp_input.concurrency, sssp_input.return_path); |
There was a problem hiding this comment.
Since the else branch does not handle return_path, the logic related to return_path can be removed.
There was a problem hiding this comment.
Done. The else branch no longer passes return_path or path_alias to the parallel SSSP. The return_path_ and predecessors_ members have been completely removed from the SSSP class, along with the dead if (return_path_) branches in compute() and sink(). The parallel SSSP constructor no longer takes a return_path parameter.
Address review comment from liulx20: use NEUG_LDBC_DATA_DIR environment variable instead of hardcoded ~/Downloads path. Test skips when the variable is not set.
…rays Address liulx20's review comments on PR alibaba#574: - Remove predecessor arrays from BFS, BFSPred, SSSP, SSSPPred. Paths are now reconstructed post-hoc in sink() by walking the distance array backward: for vertex v at distance d, find a neighbor u at distance d-1. This eliminates the concurrency hazard entirely (path reconstruction is single-threaded). - BFS dense pull mode: revert CAS to simple non-atomic write (CAS was only needed to protect predecessor writes, which are now gone). - SSSP parallel: remove return_path_ and predecessors_ entirely (parallel SSSP never produces paths; path return routes through SSSPPred). - Remove configure_path_encoding() wrapper; call set_path_full_encoding() directly in bfs.cc/sssp.cc and reset to true (default) after sink() to prevent the thread-local flag from leaking to other queries. - Use return_path_ instead of path_alias >= 0 as the path output indicator (-1 is a valid alias per liulx20's comment). - Remove PlainPredecessorAccessor; reconstruct_path() now accepts a callable for finding predecessors on-the-fly. Performance verified on Graph500-23 (4.6M vertices, 129M edges): BFS with path: 38.2s (original) → 37.1s (refactor) SSSP with path: 47.4s (original) → 46.0s (refactor) No regression; slight improvement from removed CAS overhead.
Address liulx20's comment: "this variable now affects the sink behavior of other queries, which is unreasonable." Remove the thread_local global variable path_full_encoding_enabled from sink.cc. Instead, the encoding mode is carried as a member of PathColumn/PathColumnBuilder: - PathColumnBuilder::set_full_encoding(bool) sets the mode - PathColumnBuilder::finish() propagates it to PathColumn - PathColumn::full_encoding() reads it - sink.cc's add_column() reads it from the column and passes to convert_path_to_json() GDS sink() methods receive full_encoding as a parameter and set it on the PathColumnBuilder. Non-GDS path columns default to true (full encoding), so other queries are never affected.
Summary
Add shortest path return to GDS BFS and SSSP algorithms, plus path serialization optimizations that benefit all path-returning queries.
Changes
Commit 1: feat(gds): add shortest path return to BFS and SSSP algorithms
pathoutput column via YIELD clause (pure YIELD-based, no config options)extension/gds/— no system code modifiedCommit 2-3: perf: optimize path JSON serialization
convert_path_to_json(sink.cc)build_vertex_json_value/build_edge_json_valueconvert_vertex_to_json/convert_edge_to_jsonunchanged for standalone callersCommit 4: feat(gds): add path_properties configuration
path_properties: 'full'): All vertex and edge propertiesUsage
Testing
Performance (LDBC SNB SF10: 65K nodes, 1.9M edges, 62K paths)
Files Changed
extension/gds/— 12 files (path return implementation + configuration)include/neug/execution/common/operators/retrieve/sink.h— encoding mode APIsrc/execution/common/operators/retrieve/sink.cc— serialization optimizationtools/python_bind/tests/— 3 new test files, 21 new tests