Skip to content

fix(storage): guard CSR getGroup against stale/OOB group indices (SIGSEGV)#610

Closed
rysweet wants to merge 1 commit into
LadybugDB:mainfrom
rysweet:fix/csr-getgroup-null-deref-segv
Closed

fix(storage): guard CSR getGroup against stale/OOB group indices (SIGSEGV)#610
rysweet wants to merge 1 commit into
LadybugDB:mainfrom
rysweet:fix/csr-getgroup-null-deref-segv

Conversation

@rysweet

@rysweet rysweet commented Jun 22, 2026

Copy link
Copy Markdown

Problem

A long-running process doing relationship deletes + checkpoints crashes with
SIGSEGV during the consolidation/checkpoint phase. The faulting access is
GroupCollection<ChunkedNodeGroup>::getGroup(groupIdx = 4294967295 / UINT32_MAX).

Root cause

After relationship delete + checkpoint churn, csrIndex->indices[nodeOffset] can
retain a stale row (an INVALID_ROW_IDX sentinel). getQuotientRemainder(INVALID_ROW_IDX)
yields an out-of-range chunk-group index. GroupCollection::getGroup only had a
debug-only DASSERT(groupIdx < groups.size()) (a no-op in release), so it performed an
out-of-bounds std::vector access and callers dereferenced the result → undefined
behavior / SIGSEGV.

The insertion-write loop in checkpointColumnInRegion already guards this case with
if (row == INVALID_ROW_IDX) continue;, but the two checkpoint-collect loops
(collectInMemRegionChangesAndUpdateHeaderLength and populateCSRLengthInMemOnly),
plus the scan / update / delete_ paths, do not — they dereference the getGroup
result directly.

Fix

  • GroupCollection::getGroup / getGroupNoLock now return nullptr for an out-of-range
    index instead of performing an OOB access.
  • Every CSRNodeGroup deref site handles nullptr:
    • scan (scanCommittedInMemSequential / scanCommittedInMemRandom): skip / empty result;
    • update: no-op (target row gone); delete_: returns false;
    • checkpoint insertion-write: skip;
    • the two checkpoint-collect isDeleted loops: treat a stale/null group as deleted
      (purges the stale row via turnToNonSequential() + setInvalid(i)), consistent with
      the engine's existing handling of deleted rows.

This is version-independent (reproduced on 0.15.x–0.17.1) and present on current main.

Validation

Reproduced via a deterministic 120-iteration delete+checkpoint churn test (amplihack-memory
issue #100). Before: SIGSEGV mid-churn in collectInMemRegionChangesAndUpdateHeaderLength.
After (engine rebuilt from source with this patch): the workload completes cleanly and the
test passes.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…SEGV)

Relationship delete + checkpoint churn can leave a stale row in
csrIndex->indices[nodeOffset] that maps (via getQuotientRemainder of an
INVALID_ROW_IDX sentinel) to an out-of-range chunk-group index. getGroup
then performed an out-of-bounds std::vector access and callers
dereferenced the result, crashing with SIGSEGV during the consolidation/
checkpoint phase (getGroup(groupIdx=UINT32_MAX)).

GroupCollection::getGroup/getGroupNoLock now return nullptr for an
out-of-range index instead of an OOB access, and every CSRNodeGroup
deref site (scan, update, delete_, checkpoint insertion-write, and the
two checkpoint-collect isDeleted loops) handles nullptr by skipping /
treating the stale row as deleted. This matches the engine's existing
INVALID_ROW_IDX skip in the insertion loop.

Reproduced and fixed via a 120-iteration delete+checkpoint churn test
in amplihack-memory (issue LadybugDB#100): the workload previously SIGSEGV'd
mid-churn in collectInMemRegionChangesAndUpdateHeaderLength and now
completes cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rysweet

rysweet commented Jun 22, 2026

Copy link
Copy Markdown
Author

Withdrawing this PR — it was opened prematurely. Closing to validate the fix and CI in a private fork first; will re-open a clean, CI-green PR if appropriate. Apologies for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant