perf: XL-repo indexing pass (cpp hints regex, git deep walk, xaml index reuse)#459
Merged
Conversation
The cpp dynamic-hints extractor's function-definition regex backtracks
catastrophically on long runs of prefix-class characters (large
initializer lists, expression-template headers): 38.6s of a 39.8s
extract on a WinUI-scale monorepo, 30s of it in three files.
No component of the pattern can match across ';', '{' or '}'. The one
exception, a noexcept argument span containing one of those characters
before its first closing parenthesis, is detected per file and falls
back to the full-text scan. Every match therefore lies between a
delimiter and the next open brace, so the unchanged pattern is searched
window-by-window via search(pos, endpos), which preserves anchor
semantics against the full string. Windows without a parenthesis cannot
satisfy the mandatory argument span and are skipped.
Matches are byte-identical across a 1,295-file C++ validation corpus
(20,148 matches, zero diffs); extractor edges unchanged. Isolated:
cpp extract 39.8s to 2.9s, dynamic hints extract_all 47.2s to 12.4s.
Files absent from the recent-window commit index each spawned a per-file 'git log --numstat' subprocess fallback. On repos whose history is much deeper than the 500-commit window that fallback dominates the git phase: a 9k-commit WinUI monorepo left 3,295 of 4,857 indexable files to it, costing ~40s of the ESSENTIAL tier. When the window misses at least 100 files, one additional 'git log --skip=500 --numstat --no-merges' walk over the older history region now buckets their commits (newest first, capped at the per-file limit), and per-file indexing reads from that bucket. --skip counts after --no-merges filtering exactly like the window walk's -N cap, so the two regions partition the history. Files the deep walk still misses keep the per-file fallback, and repos below the threshold are completely unchanged. Deep-bucketed files move from pathspec diff semantics to repo-walk diff semantics, which window-indexed files always had: rename rows attribute edit churn through the rename instead of showing a whole-file addition. Quantified on the monorepo: 937 of 4,857 files differ, 929 of them only in temporal_hotspot_score; hotspot flags unchanged. Git ESSENTIAL 43.2s to 12.9s; zero per-file fallbacks remain there.
Git emits directory insertions and removals as rename markers with an
empty side: 'src/{ => newdir}/file.cs'. The marker regex required at
least one character on each side, so those lines never resolved to a
tracked path and the commit was silently dropped from the file's
history in both the window and deep commit indexes.
Allow empty sides and collapse the doubled slash the empty expansion
leaves at the splice point. On a 9k-commit WinUI monorepo this
recovers the missing commits: window-index coverage 1,562 to 1,564
files, deep-walk coverage 3,161 to 3,293 of 3,293 missed files, and
the deep-vs-fallback structural metadata diffs collapse
(first_commit_at 91 to 26 files, owner flips 12 to 4).
The XAML dynamic-hints extractor needs the repo's C# type map and rebuilt the full DotNetProjectIndex from scratch to get it: a second complete .cs walk, read, and declaration scan on every init and every update, duplicating the index the graph resolvers had already built moments earlier (~2-3s on a 3,400-file C# monorepo). GraphBuilder.build() now stashes the resolver-built index (only when nested-git pruning matches the standalone build's behaviour, so the maps are guaranteed identical), and both pipeline call sites pass it through HintRegistry.extract_all to the extractor fleet the same way the shared walk snapshot travels. The XAML extractor uses the provided index and falls back to building its own when none is attached, so standalone use is unchanged. Emitted edges are identical with and without the prebuilt index; extract_all on the monorepo drops from 9.6s to 7.2s.
|
✅ Health: 7.7 (unchanged) 🚨 Change risk: high (riskier than 71% of this repo's commits · raw 9.4/10)
🩹 Review priority (files here with the most recent bug-fix history — defects cluster, so review these first)
🔥 Hotspots touched (5)
2 more
🔗 Hidden coupling (3 files)
💀 Dead code (3 findings)
📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-12 04:29 UTC |
swati510
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Profiles the indexing pipeline on a WinUI-scale monorepo (PowerToys: 7,870 tracked files, 3,436 C#, ~1,300 C++/headers, 9,139 commits) and fixes what it surfaced. Wall clock on that repo:
init --index-only357.6s to 250.4s,update161.8s to 138.1s.Changes
C++ dynamic hints: delimiter-windowed function-definition scan. The extractor's function-definition regex backtracks catastrophically on long runs of identifiers and whitespace (large initializer lists, expression-template headers): 38.6s of a 39.8s extract, with 30s concentrated in three files. No component of the pattern can match across
;,{, or}(the one exception, noexcept argument spans, is detected per file and falls back to the full scan), so the unchanged pattern is now searched window-by-window withsearch(pos, endpos), which preserves anchor semantics against the full string. Matches are byte-identical across the 1,295-file validation corpus (20,148 matches); extract drops to 2.9s and runs on every init and update.Git: batch deep-history file indexing into one
--skiplog walk. Files absent from the recent-window commit index each spawned a per-filegit log --numstatsubprocess; on this repo 3,295 of 4,857 indexable files missed the window, costing ~40s. When at least 100 files miss, onegit log --skip=500walk over the older history region buckets their commits instead; files it still misses keep the per-file path, and repos below the threshold are completely unchanged. ESSENTIAL-tier git indexing drops from 43.2s to 12.9s with zero per-file fallbacks remaining. Deep-bucketed files move from pathspec diff semantics to repo-walk diff semantics, which window-indexed files always had; quantified in the commit message (937 of 4,857 files differ, 929 only in the decayed churn score, hotspot flags unchanged).Git: expand empty-side numstat rename markers. Pre-existing bug: directory insertions and removals (
src/{ => newdir}/file.cs) never matched the rename-marker regex, silently dropping those commits from file history in both commit-index walks. Recovers the missing history and collapses the structural metadata differences between the batched and per-file paths (first-commit dates, owners).XAML hints: reuse the resolver-built .NET index. The XAML extractor rebuilt the full
DotNetProjectIndex(a second complete.cswalk, read, and scan) on every init and update to get the type map the graph resolvers had already built. The graph builder now stashes its index and the pipeline passes it through the hint registry; emitted edges are identical and the hints phase drops another ~2.4s.Validation
Changed files: 1); hotspot counts unchanged across the git change.