Skip to content

perf: XL-repo indexing pass (cpp hints regex, git deep walk, xaml index reuse)#459

Merged
RaghavChamadiya merged 4 commits into
mainfrom
perf/indexing-xl-repos
Jun 12, 2026
Merged

perf: XL-repo indexing pass (cpp hints regex, git deep walk, xaml index reuse)#459
RaghavChamadiya merged 4 commits into
mainfrom
perf/indexing-xl-repos

Conversation

@RaghavChamadiya

Copy link
Copy Markdown
Member

Profiles the indexing pipeline on a WinUI-scale monorepo (PowerToys: 7,870 tracked files, 3,436 C#, ~1,300 C++/headers, 9,139 commits) and fixes what it surfaced. Wall clock on that repo: init --index-only 357.6s to 250.4s, update 161.8s to 138.1s.

Changes

C++ dynamic hints: delimiter-windowed function-definition scan. The extractor's function-definition regex backtracks catastrophically on long runs of identifiers and whitespace (large initializer lists, expression-template headers): 38.6s of a 39.8s extract, with 30s concentrated in three files. No component of the pattern can match across ;, {, or } (the one exception, noexcept argument spans, is detected per file and falls back to the full scan), so the unchanged pattern is now searched window-by-window with search(pos, endpos), which preserves anchor semantics against the full string. Matches are byte-identical across the 1,295-file validation corpus (20,148 matches); extract drops to 2.9s and runs on every init and update.

Git: batch deep-history file indexing into one --skip log walk. Files absent from the recent-window commit index each spawned a per-file git log --numstat subprocess; on this repo 3,295 of 4,857 indexable files missed the window, costing ~40s. When at least 100 files miss, one git log --skip=500 walk over the older history region buckets their commits instead; files it still misses keep the per-file path, and repos below the threshold are completely unchanged. ESSENTIAL-tier git indexing drops from 43.2s to 12.9s with zero per-file fallbacks remaining. Deep-bucketed files move from pathspec diff semantics to repo-walk diff semantics, which window-indexed files always had; quantified in the commit message (937 of 4,857 files differ, 929 only in the decayed churn score, hotspot flags unchanged).

Git: expand empty-side numstat rename markers. Pre-existing bug: directory insertions and removals (src/{ => newdir}/file.cs) never matched the rename-marker regex, silently dropping those commits from file history in both commit-index walks. Recovers the missing history and collapses the structural metadata differences between the batched and per-file paths (first-commit dates, owners).

XAML hints: reuse the resolver-built .NET index. The XAML extractor rebuilt the full DotNetProjectIndex (a second complete .cs walk, read, and scan) on every init and update to get the type map the graph resolvers had already built. The graph builder now stashes its index and the pipeline passes it through the hint registry; emitted edges are identical and the hints phase drops another ~2.4s.

Validation

  • Every change carries an equivalence or regression test: a windowed-vs-full-scan oracle corpus for the regex, a per-file-fallback oracle plus zero-subprocess assertion for the deep walk, all rename-marker forms, and a no-rebuild assertion plus edge-set equality for the index reuse.
  • Full unit suite green (4,941 passed).
  • Update verified on the incremental path (log shows Changed files: 1); hotspot counts unchanged across the git change.

The cpp dynamic-hints extractor's function-definition regex backtracks
catastrophically on long runs of prefix-class characters (large
initializer lists, expression-template headers): 38.6s of a 39.8s
extract on a WinUI-scale monorepo, 30s of it in three files.

No component of the pattern can match across ';', '{' or '}'. The one
exception, a noexcept argument span containing one of those characters
before its first closing parenthesis, is detected per file and falls
back to the full-text scan. Every match therefore lies between a
delimiter and the next open brace, so the unchanged pattern is searched
window-by-window via search(pos, endpos), which preserves anchor
semantics against the full string. Windows without a parenthesis cannot
satisfy the mandatory argument span and are skipped.

Matches are byte-identical across a 1,295-file C++ validation corpus
(20,148 matches, zero diffs); extractor edges unchanged. Isolated:
cpp extract 39.8s to 2.9s, dynamic hints extract_all 47.2s to 12.4s.
Files absent from the recent-window commit index each spawned a
per-file 'git log --numstat' subprocess fallback. On repos whose
history is much deeper than the 500-commit window that fallback
dominates the git phase: a 9k-commit WinUI monorepo left 3,295 of
4,857 indexable files to it, costing ~40s of the ESSENTIAL tier.

When the window misses at least 100 files, one additional
'git log --skip=500 --numstat --no-merges' walk over the older history
region now buckets their commits (newest first, capped at the per-file
limit), and per-file indexing reads from that bucket. --skip counts
after --no-merges filtering exactly like the window walk's -N cap, so
the two regions partition the history. Files the deep walk still
misses keep the per-file fallback, and repos below the threshold are
completely unchanged.

Deep-bucketed files move from pathspec diff semantics to repo-walk
diff semantics, which window-indexed files always had: rename rows
attribute edit churn through the rename instead of showing a
whole-file addition. Quantified on the monorepo: 937 of 4,857 files
differ, 929 of them only in temporal_hotspot_score; hotspot flags
unchanged. Git ESSENTIAL 43.2s to 12.9s; zero per-file fallbacks
remain there.
Git emits directory insertions and removals as rename markers with an
empty side: 'src/{ => newdir}/file.cs'. The marker regex required at
least one character on each side, so those lines never resolved to a
tracked path and the commit was silently dropped from the file's
history in both the window and deep commit indexes.

Allow empty sides and collapse the doubled slash the empty expansion
leaves at the splice point. On a 9k-commit WinUI monorepo this
recovers the missing commits: window-index coverage 1,562 to 1,564
files, deep-walk coverage 3,161 to 3,293 of 3,293 missed files, and
the deep-vs-fallback structural metadata diffs collapse
(first_commit_at 91 to 26 files, owner flips 12 to 4).
The XAML dynamic-hints extractor needs the repo's C# type map and
rebuilt the full DotNetProjectIndex from scratch to get it: a second
complete .cs walk, read, and declaration scan on every init and every
update, duplicating the index the graph resolvers had already built
moments earlier (~2-3s on a 3,400-file C# monorepo).

GraphBuilder.build() now stashes the resolver-built index (only when
nested-git pruning matches the standalone build's behaviour, so the
maps are guaranteed identical), and both pipeline call sites pass it
through HintRegistry.extract_all to the extractor fleet the same way
the shared walk snapshot travels. The XAML extractor uses the provided
index and falls back to building its own when none is attached, so
standalone use is unchanged.

Emitted edges are identical with and without the prebuilt index;
extract_all on the monorepo drops from 9.6s to 7.2s.
@repowise-bot

repowise-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

✅ Health: 7.7 (unchanged)
2 files moved · 5 hotspots · 5 hidden couplings · 5 with fix history · 3 dead-code findings

🚨 Change risk: high (riskier than 71% of this repo's commits · raw 9.4/10)
This change's risk is driven by:

  • large diff (many lines added)
  • scattered, high-entropy change

🩹 Review priority (files here with the most recent bug-fix history — defects cluster, so review these first)

File Score Δ Why
.../ingestion/git_commit_index.py 3.9 → 3.0 ▼ -1.0 🔻 introduced nested complexity, large method, complex method · ✅ resolved function hotspot
.../git_indexer/indexer.py 2.8 → 2.6 ▼ -0.2 ✅ resolved function hotspot, dry violation

💡 .../ingestion/git_commit_index.py: Flatten the control flow. Pull early-return guards to the top, extract the deepest branch into a helper, and consider replacing nested conditionals with a strategy table or dispatch dict.

🔥 Hotspots touched (5)
  • .../ingestion/test_xaml_dynamic_hints.py — 4 commits/90d, 1 dependents · primary owner: Raghav Chamadiya (100%)
  • .../git_indexer/records.py — 5 commits/90d, 7 dependents · primary owner: Raghav Chamadiya (100%)
  • .../graph/builder.py — 11 commits/90d, 3 dependents · primary owner: Raghav Chamadiya (88%)
2 more
  • .../dynamic_hints/registry.py — 12 commits/90d, 4 dependents · primary owner: Raghav Chamadiya (95%)
  • .../pipeline/incremental.py — 3 commits/90d, 3 dependents · primary owner: Raghav Chamadiya (100%)
🔗 Hidden coupling (3 files)
  • .../graph/builder.py co-changes with these files (not in this PR):
    • .../ingestion/call_resolver.py (5× — 🟢 routine)
    • .../commands/update_cmd.py (4× — 🟢 routine)
  • .../dynamic_hints/registry.py co-changes with docs/LANGUAGE_SUPPORT.md (5× — 🟢 routine) — not in this PR.
  • .../git_indexer/indexer.py co-changes with these files (not in this PR):
    • .../git_indexer/file_history.py (5× — 🟢 routine)
    • .../persistence/models.py (5× — 🟢 routine)
💀 Dead code (3 findings)
  • 💀 .../pipeline/incremental.py _noop_log (confidence 0.65)
  • 💀 .../git_indexer/_constants.py _quiet_del (confidence 0.65)
  • 💀 .../phases/ingestion.py _parse_one (confidence 0.65)

📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-12 04:29 UTC
Silence on a single PR with [skip repowise] in the title · Per-repo toggle on repowise.dev/settings?tab=bot

@RaghavChamadiya RaghavChamadiya merged commit ec6a97b into main Jun 12, 2026
5 checks passed
@RaghavChamadiya RaghavChamadiya deleted the perf/indexing-xl-repos branch June 12, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants