perf(health): incremental duplication pair splice for update runs#460
Merged
Conversation
A one-file update re-ran the full duplication pipeline (2.3M windows at PowerToys scale) because duplication_pct is repo-wide. Raw clone pairs are a pure function of the gate-surviving window set, and each window pair lives in exactly one hash bucket, so pairs between unchanged files cannot change. Persist the raw-pair multiset (count-compressed) in a versioned artifact next to the duplication token cache and, on incremental runs, re-verify only buckets touched by changed or deleted files: subtract each touched bucket's old contribution, add its new one, and keep everything else verbatim. The degenerate-bucket cap applies to each touched bucket's full membership, so cap transitions in either direction stay exact. Finalize (merge, min-lines filter, co-change weighting) always runs live against the current git metadata. Any validity miss falls back to the full pipeline, which rewrites the artifact: window/limits mismatch, truncated or timed-out persisted state, missing token-cache entries, multiset accounting mismatch, or more than max(16, 20% of files) moved. PowerToys (5,481 parsed files), isolated: warm full detect_clones 12.7s to 3.4s incremental, report equal pair-for-pair (multiset including token_count) and pct-for-pct against a fresh full recompute, for both a real modification and a no-op change.
|
✅ Health: 7.6 (unchanged) 🚨 Change risk: high (riskier than 76% of this repo's commits · raw 9.5/10)
🩹 Review priority (files here with the most recent bug-fix history — defects cluster, so review these first)
🔥 Hotspots touched (3)
🔗 Hidden coupling (1 file)
📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-12 05:58 UTC |
swati510
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A one-file
repowise updatere-runs the full duplication pipeline becauseduplication_pctis repo-wide: on a PowerToys-scale repo that means rebuilding 2.3M rolling-hash windows and re-verifying 1.25M candidate pairs to discover that almost nothing changed. The token cache (added earlier) removed the re-tokenize cost, but the windowing, bucketing, and verification stages still ran flat on every update, ~11-13s warm.Approach
Raw clone pairs are a pure function of the gate-surviving window set, and each window pair lives in exactly one hash bucket, so pairs between unchanged files cannot change. This PR persists the raw-pair multiset in a versioned artifact (
duplication_pairs.pkl) next to the duplication token cache, anddetect_clonesgains achanged_filesparameter. On incremental runs it splices instead of recomputing:token_count; identical raw pairs always merge together, which also lets the artifact store counts instead of repeated rows (1.25M pairs compress to 367k rows, 7.7MB on PowerToys).The artifact also records which files the detector considered but gated out (minified, too small, over the token cap), so they are not mistaken for new files on every run, plus the total window count and guard flags.
Any validity miss falls back to the full pipeline, which rewrites the artifact: window-size or limits mismatch, truncated or timed-out persisted state, missing token-cache entries, a splice accounting mismatch, or more than max(16, 20% of files) moved. The health engine passes the changed set through on both the sync and async paths; full runs (init) are unchanged apart from persisting the artifact.
Equivalence
tests/unit/health/test_duplication_incremental.pyholds 18 oracle tests asserting the incremental report equals a fresh full recompute, pair-for-pair as a multiset includingtoken_count, plusduplication_pctandpairs_by_file: modify/add/delete/rename, no-op rewrites, intra-file clones, degenerate-cap transitions in both directions, chained splices, gated files staying gated and growing into participants, and every fallback path (too many changes, limits change, missing/corrupt artifact, truncated state). Token-cache retention across splice saves and live co-change weighting are covered too.At scale: an A/B probe on PowerToys (5,481 parsed files) appends a real clone to a source file and verifies the spliced report equals a fresh full recompute exactly, for both the modification and a no-op change.
Numbers (1 run per config)
detect_cloneswarm, isolated (PowerToys)updatewallupdatewallFull unit suite: 5,015 passed.