Dev/ida streaming memory#2
Merged
Merged
Conversation
Replace the source-text scan for cross-references with authoritative decompiler data, speed up the metadata phase, and surface its progress. - Collect data cross-references from the backend (IDA xrefblk / r2 axtj) during analysis while the session is open, store them on ProgramAnalysis, and resolve each reference to its containing function at metadata time. Removes the O(variables x source-lines) text scan in strings.json and variables.json; xrefs are now real read/write refs. - Add a stepped progress bar and per-step logging to the metadata phase, and copy the published database in chunks with a byte progress bar. - Add a --entropy flag (off by default); per-section Shannon entropy is skipped unless requested, and the entropy histogram now counts at C speed. Skips the costly per-byte scan over large segments by default. - Recover argument/local counts from the rendered summary so functions.json reports real nargs/nlocals without re-decompiling. - Skip the IDA string-list rebuild when a reused database already has one, and drop dead disasm/decompile/summary caches and _prime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A requested `--jobs N` previously bypassed the memory budget. On a large database (e.g. a kernel) each worker loads the whole `.i64`, so N workers that cannot fit in RAM get OOM-killed mid-export; the streaming pool then breaks and falls back to the slow single-session path, discarding progress. Apply the database-size-aware IDA memory ceiling to requested counts too, log a note when the count is reduced (with the TOCODE_IDA_WORKER_MEMORY_MB override), and reflect the cap in the worker summary line. Non-IDA backends still honor the requested count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-worker memory ceiling was already derived at runtime from the host's available memory and the real database size, but two heuristic constants (base overhead and database resident factor) were hardcoded. Expose them as TOCODE_IDA_WORKER_BASE_MEMORY_MB and TOCODE_IDA_DB_RESIDENT_FACTOR (same defaults) and add coverage for tuning the model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reliable, faster IDA exports for large databases (kernels).
.i64: worker copies off tmpfs, worker count capped by real RAM + DB size (env-tunable)..i64.--entropyis opt-in, metadata/load/DB-copy phases now show progress.