Summary
A long-running process that performs relationship deletes followed by checkpoints crashes with SIGSEGV during the consolidation/checkpoint phase. The faulting access is:
GroupCollection<ChunkedNodeGroup>::getGroup(groupIdx = 4294967295 / UINT32_MAX)
i.e. getGroup is called with an out-of-range group index and the result is dereferenced.
This reproduces on 0.15.3, 0.15.4, and 0.17.1, and the same unguarded code is present on current main (7c5210f), so it appears version-independent.
Symptom / backtraces
Two distinct manifestations, same faulting call (getGroup(UINT32_MAX)):
1) During a relationship READ scan (ScanMultiRelTable):
getGroup(UINT32_MAX) null/OOB deref
<- CSRNodeGroup::scanCommittedInMemRandom <- CSRNodeGroup::scan
<- RelTableScanState::scanNext <- RelTableCollectionScanner::scan (scan_multi_rel_tables.cpp)
<- ScanMultiRelTable::getNextTuplesInternal <- ... <- TaskScheduler worker thread
2) During CHECKPOINT (collectInMemRegionChangesAndUpdateHeaderLength):
ChunkedNodeGroup::isDeleted(this = 0x0)
<- CSRNodeGroup::collectInMemRegionChangesAndUpdateHeaderLength (checkpoint)
Root cause
After relationship delete + checkpoint churn, an in-memory CSR index entry (csrIndex->indices[nodeOffset]) can retain a stale row (an INVALID_ROW_IDX sentinel). getQuotientRemainder(INVALID_ROW_IDX, CHUNKED_NODE_GROUP_CAPACITY) then yields an out-of-range chunk-group index.
GroupCollection::getGroup / getGroupNoLock (src/include/storage/table/group_collection.h) only guard the index with a debug-only DASSERT(groupIdx < groups.size()), which is a no-op in release builds, so they perform an out-of-bounds std::vector access and the callers dereference the result.
The insertion-write loop in CSRNodeGroup::checkpointColumnInRegion already guards this with if (row == INVALID_ROW_IDX) continue;, but several other CSRNodeGroup paths do not and call getGroup(chunkIdx) directly on the stale index:
scanCommittedInMemSequential / scanCommittedInMemRandom
update / delete_ (COMMITTED_IN_MEMORY case)
- the two checkpoint-collect loops:
collectInMemRegionChangesAndUpdateHeaderLength and populateCSRLengthInMemOnly (both call getGroup(...)->isDeleted(...) with no guard)
(Line references on main@7c5210f: group_collection.h:49/52/55/56; csr_node_group.cpp getGroup call sites ~293, ~320, ~412, ~434, ~856 (guarded), ~917, ~1121.)
Reproduction
A deterministic ~120-iteration loop of: add relationships across one/more rel tables among a small set of nodes, delete some of them, checkpoint, repeat — then run a relationship scan / next checkpoint. The committed in-memory CSR index ends up with a stale row mapping to an out-of-range chunk group, and the next scan or checkpoint dereferences it. (The daemon that surfaced this recovers via WAL each crash, so there's no data loss — it just crashes ~hourly.)
I'm happy to share a minimal standalone Cypher/embedded reproduction if that would help.
Note on a fix
I have a candidate fix that bounds-checks chunkIdx against getNumGroups() before calling getGroup at each affected CSRNodeGroup site (skipping / treating the stale row as deleted, consistent with the existing INVALID_ROW_IDX handling), leaving getGroup's contract unchanged. It builds clean under the minimal test (linux) job (-Werror, GCC 13) and passes the full e2e test suite in my private validation run.
I'm glad to open a PR if that's useful — but I completely understand a bug report may be more valuable than a patch here, and you may prefer to address the stale-CSR-index root cause your own way. Happy to help however is most useful.
Summary
A long-running process that performs relationship deletes followed by checkpoints crashes with
SIGSEGVduring the consolidation/checkpoint phase. The faulting access is:i.e.
getGroupis called with an out-of-range group index and the result is dereferenced.This reproduces on 0.15.3, 0.15.4, and 0.17.1, and the same unguarded code is present on current
main(7c5210f), so it appears version-independent.Symptom / backtraces
Two distinct manifestations, same faulting call (
getGroup(UINT32_MAX)):1) During a relationship READ scan (
ScanMultiRelTable):2) During CHECKPOINT (
collectInMemRegionChangesAndUpdateHeaderLength):Root cause
After relationship delete + checkpoint churn, an in-memory CSR index entry (
csrIndex->indices[nodeOffset]) can retain a stale row (anINVALID_ROW_IDXsentinel).getQuotientRemainder(INVALID_ROW_IDX, CHUNKED_NODE_GROUP_CAPACITY)then yields an out-of-range chunk-group index.GroupCollection::getGroup/getGroupNoLock(src/include/storage/table/group_collection.h) only guard the index with a debug-onlyDASSERT(groupIdx < groups.size()), which is a no-op in release builds, so they perform an out-of-boundsstd::vectoraccess and the callers dereference the result.The insertion-write loop in
CSRNodeGroup::checkpointColumnInRegionalready guards this withif (row == INVALID_ROW_IDX) continue;, but several otherCSRNodeGrouppaths do not and callgetGroup(chunkIdx)directly on the stale index:scanCommittedInMemSequential/scanCommittedInMemRandomupdate/delete_(COMMITTED_IN_MEMORYcase)collectInMemRegionChangesAndUpdateHeaderLengthandpopulateCSRLengthInMemOnly(both callgetGroup(...)->isDeleted(...)with no guard)(Line references on
main@7c5210f:group_collection.h:49/52/55/56;csr_node_group.cppgetGroup call sites ~293, ~320, ~412, ~434, ~856 (guarded), ~917, ~1121.)Reproduction
A deterministic ~120-iteration loop of: add relationships across one/more rel tables among a small set of nodes, delete some of them,
checkpoint, repeat — then run a relationship scan / next checkpoint. The committed in-memory CSR index ends up with a stale row mapping to an out-of-range chunk group, and the next scan or checkpoint dereferences it. (The daemon that surfaced this recovers via WAL each crash, so there's no data loss — it just crashes ~hourly.)I'm happy to share a minimal standalone Cypher/embedded reproduction if that would help.
Note on a fix
I have a candidate fix that bounds-checks
chunkIdxagainstgetNumGroups()before callinggetGroupat each affectedCSRNodeGroupsite (skipping / treating the stale row as deleted, consistent with the existingINVALID_ROW_IDXhandling), leavinggetGroup's contract unchanged. It builds clean under theminimal test (linux)job (-Werror, GCC 13) and passes the full e2e test suite in my private validation run.I'm glad to open a PR if that's useful — but I completely understand a bug report may be more valuable than a patch here, and you may prefer to address the stale-CSR-index root cause your own way. Happy to help however is most useful.