Skip to content

Full-repo indexing crashes in parallel.extract on large AOSP-scale repository #141

@sunny5541

Description

@sunny5541

codebase-memory-mcp crashes during index_repository on a very large AOSP-scale repository.

The failure happens in parallel.extract with:

Segmentation fault (core dumped)

This still reproduces after:

- excluding heavy generated/build directories with .cbmignore
- switching to mode: "fast"
- reducing the discovered file set significantly

## Version / source info

This is a locally built dev binary from source, not a tagged release binary.

- binary version: codebase-memory-mcp dev
- source checkout commit: e6e9c58e80a808d9a8450b11dc6c8dde085c8417
- source checkout branch: main


## Environment

- OS: Ubuntu 22.04.5 LTS
- kernel: Linux ... 6.8.0-106-generic ... x86_64
- glibc: 2.35
- gcc: 11.4.0
- g++: 11.4.0
- make: 4.3
- clang not installed in this environment
- cmake not installed in this environment

## Reproduction

codebase-memory-mcp cli index_repository '{"repo_path":"/absolute/path/to/large/aosp-style/repo"}'

Also reproduces with:

codebase-memory-mcp cli index_repository '{"repo_path":"/absolute/path/to/large/aosp-style/repo","mode":"fast"}'

Example exclusions used during investigation:

out/
out/**
prebuilts/
prebuilts/**
kernel/prebuilts/
kernel/prebuilts/**
external/XNNPACK/test/
external/XNNPACK/test/**
external/libabigail/tests/
external/libabigail/tests/**
packages/modules/NeuralNetworks/runtime/test/
packages/modules/NeuralNetworks/runtime/test/**
packages/modules/NeuralNetworks/tools/systrace_parser/parser/test/
packages/modules/NeuralNetworks/tools/systrace_parser/parser/test/**
packages/apps/TV/tuner/tests/
packages/apps/TV/tuner/tests/**

## Observed scale

- before exclusions: about 847196 files
- after excluding out/: about 796478 files
- after more aggressive exclusions + mode:"fast": about 228794 files

Even at the smaller count, the process still crashes in parallel.extract.

## Representative log

level=info msg=parallel.extract.file.done pos=547 elapsed_ms=5895 defs=141 path=art/compiler/optimizing/load_store_elimination_test.cc
level=info msg=parallel.extract.progress done=820 total=228794
level=info msg=parallel.extract.progress done=830 total=228794
level=info msg=parallel.extract.file.done pos=495 elapsed_ms=7780 defs=196 path=kernel-6.1/drivers/net/wireless/ralink/rt2x00/rt2800lib.c
level=info msg=parallel.extract.progress done=840 total=228794
level=info msg=parallel.extract.progress done=850 total=228794
level=info msg=parallel.extract.progress done=860 total=228794
level=info msg=parallel.extract.progress done=870 total=228794
level=info msg=parallel.extract.file.done pos=621 elapsed_ms=5011 defs=387 path=art/compiler/optimizing/code_generator_arm64.cc
level=info msg=parallel.extract.progress done=880 total=228794
level=info msg=parallel.extract.progress done=890 total=228794
level=info msg=parallel.extract.progress done=900 total=228794
level=info msg=parallel.extract.progress done=910 total=228794
level=info msg=parallel.extract.progress done=920 total=228794
level=info msg=parallel.extract.progress done=930 total=228794
level=info msg=parallel.extract.progress done=940 total=228794
level=info msg=parallel.extract.progress done=950 total=228794
level=info msg=parallel.extract.progress done=960 total=228794
level=info msg=parallel.extract.progress done=990 total=228794
level=info msg=parallel.extract.progress done=970 total=228794
level=info msg=parallel.extract.progress done=1000 total=228794
level=info msg=parallel.extract.progress done=1010 total=228794
Segmentation fault (core dumped)

## Additional observations

- The last files printed before the crash vary between runs.
- A miniature repo built from a few “last logged” files indexed successfully.
- Shard indexing works reliably when the large repo is split into smaller subtrees.

This makes it look more like a scalability / concurrency / cumulative-memory problem than a single bad file.

## Initial source-level observations

These seem relevant from the current source tree:

- initial indexing uses all cores: cbm_default_worker_count(true)
- large repos almost always use parallel extraction (MIN_FILES_FOR_PARALLEL = 50)
- memory budget appears to be logged, not enforced
- worker threads use fixed 8 MB stacks
- result_cache retains extracted per-file results across extraction / registry / resolve
- CBM_EXTRACT_BUDGET is a parse timeout, not a memory bound
- fast mode still uses the same parallel extraction path

## Request

Is this a known scalability limitation in the current dev build?

The most promising fixes seem to be:

1. add a user-facing max_workers option to index_repository
2. enforce memory-based backpressure instead of only logging budget
3. batch or stream result_cache instead of retaining results for the whole repo
4. provide a safer fallback for huge repositories

Metadata

Metadata

Assignees

No one assigned

    Labels

    stability/performanceServer crashes, OOM, hangs, high CPU/memory

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions