SIGSEGV: getGroup(UINT32_MAX) — unguarded out-of-range CSR chunk-group index on rel delete + checkpoint

## Summary

A long-running process that performs **relationship deletes followed by checkpoints** crashes with `SIGSEGV` during the consolidation/checkpoint phase. The faulting access is:

```
GroupCollection<ChunkedNodeGroup>::getGroup(groupIdx = 4294967295 / UINT32_MAX)
```

i.e. `getGroup` is called with an out-of-range group index and the result is dereferenced.

This reproduces on **0.15.3, 0.15.4, and 0.17.1**, and the same unguarded code is present on current `main` (`7c5210f`), so it appears version-independent.

## Symptom / backtraces

Two distinct manifestations, same faulting call (`getGroup(UINT32_MAX)`):

**1) During a relationship READ scan (`ScanMultiRelTable`):**
```
getGroup(UINT32_MAX) null/OOB deref
  <- CSRNodeGroup::scanCommittedInMemRandom <- CSRNodeGroup::scan
  <- RelTableScanState::scanNext <- RelTableCollectionScanner::scan (scan_multi_rel_tables.cpp)
  <- ScanMultiRelTable::getNextTuplesInternal <- ... <- TaskScheduler worker thread
```

**2) During CHECKPOINT (`collectInMemRegionChangesAndUpdateHeaderLength`):**
```
ChunkedNodeGroup::isDeleted(this = 0x0)
  <- CSRNodeGroup::collectInMemRegionChangesAndUpdateHeaderLength (checkpoint)
```

## Root cause

After relationship delete + checkpoint churn, an in-memory CSR index entry (`csrIndex->indices[nodeOffset]`) can retain a **stale row** (an `INVALID_ROW_IDX` sentinel). `getQuotientRemainder(INVALID_ROW_IDX, CHUNKED_NODE_GROUP_CAPACITY)` then yields an **out-of-range chunk-group index**.

`GroupCollection::getGroup` / `getGroupNoLock` (`src/include/storage/table/group_collection.h`) only guard the index with a debug-only `DASSERT(groupIdx < groups.size())`, which is a no-op in release builds, so they perform an **out-of-bounds `std::vector` access** and the callers dereference the result.

The insertion-write loop in `CSRNodeGroup::checkpointColumnInRegion` already guards this with `if (row == INVALID_ROW_IDX) continue;`, but several other `CSRNodeGroup` paths do **not** and call `getGroup(chunkIdx)` directly on the stale index:

- `scanCommittedInMemSequential` / `scanCommittedInMemRandom`
- `update` / `delete_` (`COMMITTED_IN_MEMORY` case)
- the two checkpoint-collect loops: `collectInMemRegionChangesAndUpdateHeaderLength` and `populateCSRLengthInMemOnly` (both call `getGroup(...)->isDeleted(...)` with no guard)

(Line references on `main@7c5210f`: `group_collection.h:49/52/55/56`; `csr_node_group.cpp` getGroup call sites ~293, ~320, ~412, ~434, ~856 (guarded), ~917, ~1121.)

## Reproduction

A deterministic ~120-iteration loop of: add relationships across one/more rel tables among a small set of nodes, delete some of them, `checkpoint`, repeat — then run a relationship scan / next checkpoint. The committed in-memory CSR index ends up with a stale row mapping to an out-of-range chunk group, and the next scan or checkpoint dereferences it. (The daemon that surfaced this recovers via WAL each crash, so there's no data loss — it just crashes ~hourly.)

I'm happy to share a minimal standalone Cypher/embedded reproduction if that would help.

## Note on a fix

I have a candidate fix that bounds-checks `chunkIdx` against `getNumGroups()` before calling `getGroup` at each affected `CSRNodeGroup` site (skipping / treating the stale row as deleted, consistent with the existing `INVALID_ROW_IDX` handling), leaving `getGroup`'s contract unchanged. It builds clean under the `minimal test (linux)` job (`-Werror`, GCC 13) and passes the full e2e test suite in my private validation run.

I'm glad to open a PR if that's useful — but I completely understand a bug report may be more valuable than a patch here, and you may prefer to address the stale-CSR-index root cause your own way. Happy to help however is most useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SIGSEGV: getGroup(UINT32_MAX) — unguarded out-of-range CSR chunk-group index on rel delete + checkpoint #611

Summary

Symptom / backtraces

Root cause

Reproduction

Note on a fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

SIGSEGV: getGroup(UINT32_MAX) — unguarded out-of-range CSR chunk-group index on rel delete + checkpoint #611

Description

Summary

Symptom / backtraces

Root cause

Reproduction

Note on a fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions