Skip to content

SIGSEGV: getGroup(UINT32_MAX) — unguarded out-of-range CSR chunk-group index on rel delete + checkpoint #611

Description

@rysweet

Summary

A long-running process that performs relationship deletes followed by checkpoints crashes with SIGSEGV during the consolidation/checkpoint phase. The faulting access is:

GroupCollection<ChunkedNodeGroup>::getGroup(groupIdx = 4294967295 / UINT32_MAX)

i.e. getGroup is called with an out-of-range group index and the result is dereferenced.

This reproduces on 0.15.3, 0.15.4, and 0.17.1, and the same unguarded code is present on current main (7c5210f), so it appears version-independent.

Symptom / backtraces

Two distinct manifestations, same faulting call (getGroup(UINT32_MAX)):

1) During a relationship READ scan (ScanMultiRelTable):

getGroup(UINT32_MAX) null/OOB deref
  <- CSRNodeGroup::scanCommittedInMemRandom <- CSRNodeGroup::scan
  <- RelTableScanState::scanNext <- RelTableCollectionScanner::scan (scan_multi_rel_tables.cpp)
  <- ScanMultiRelTable::getNextTuplesInternal <- ... <- TaskScheduler worker thread

2) During CHECKPOINT (collectInMemRegionChangesAndUpdateHeaderLength):

ChunkedNodeGroup::isDeleted(this = 0x0)
  <- CSRNodeGroup::collectInMemRegionChangesAndUpdateHeaderLength (checkpoint)

Root cause

After relationship delete + checkpoint churn, an in-memory CSR index entry (csrIndex->indices[nodeOffset]) can retain a stale row (an INVALID_ROW_IDX sentinel). getQuotientRemainder(INVALID_ROW_IDX, CHUNKED_NODE_GROUP_CAPACITY) then yields an out-of-range chunk-group index.

GroupCollection::getGroup / getGroupNoLock (src/include/storage/table/group_collection.h) only guard the index with a debug-only DASSERT(groupIdx < groups.size()), which is a no-op in release builds, so they perform an out-of-bounds std::vector access and the callers dereference the result.

The insertion-write loop in CSRNodeGroup::checkpointColumnInRegion already guards this with if (row == INVALID_ROW_IDX) continue;, but several other CSRNodeGroup paths do not and call getGroup(chunkIdx) directly on the stale index:

  • scanCommittedInMemSequential / scanCommittedInMemRandom
  • update / delete_ (COMMITTED_IN_MEMORY case)
  • the two checkpoint-collect loops: collectInMemRegionChangesAndUpdateHeaderLength and populateCSRLengthInMemOnly (both call getGroup(...)->isDeleted(...) with no guard)

(Line references on main@7c5210f: group_collection.h:49/52/55/56; csr_node_group.cpp getGroup call sites ~293, ~320, ~412, ~434, ~856 (guarded), ~917, ~1121.)

Reproduction

A deterministic ~120-iteration loop of: add relationships across one/more rel tables among a small set of nodes, delete some of them, checkpoint, repeat — then run a relationship scan / next checkpoint. The committed in-memory CSR index ends up with a stale row mapping to an out-of-range chunk group, and the next scan or checkpoint dereferences it. (The daemon that surfaced this recovers via WAL each crash, so there's no data loss — it just crashes ~hourly.)

I'm happy to share a minimal standalone Cypher/embedded reproduction if that would help.

Note on a fix

I have a candidate fix that bounds-checks chunkIdx against getNumGroups() before calling getGroup at each affected CSRNodeGroup site (skipping / treating the stale row as deleted, consistent with the existing INVALID_ROW_IDX handling), leaving getGroup's contract unchanged. It builds clean under the minimal test (linux) job (-Werror, GCC 13) and passes the full e2e test suite in my private validation run.

I'm glad to open a PR if that's useful — but I completely understand a bug report may be more valuable than a patch here, and you may prefer to address the stale-CSR-index root cause your own way. Happy to help however is most useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions