perf(health): incremental duplication pair splice for update runs by RaghavChamadiya · Pull Request #460 · repowise-dev/repowise

RaghavChamadiya · 2026-06-12T05:57:56Z

Problem

A one-file repowise update re-runs the full duplication pipeline because duplication_pct is repo-wide: on a PowerToys-scale repo that means rebuilding 2.3M rolling-hash windows and re-verifying 1.25M candidate pairs to discover that almost nothing changed. The token cache (added earlier) removed the re-tokenize cost, but the windowing, bucketing, and verification stages still ran flat on every update, ~11-13s warm.

Approach

Raw clone pairs are a pure function of the gate-surviving window set, and each window pair lives in exactly one hash bucket, so pairs between unchanged files cannot change. This PR persists the raw-pair multiset in a versioned artifact (duplication_pairs.pkl) next to the duplication token cache, and detect_clones gains a changed_files parameter. On incremental runs it splices instead of recomputing:

Re-verify only buckets touched by a changed or deleted file (old windows come from the token cache by old content hash; new windows are computed live for the changed files only).
For every touched bucket, subtract its old contribution and add its new one. Both are recomputed deterministically against the bucket's full membership, so the degenerate-bucket cap transitions correctly in either direction (a bucket shrinking below the cap revives its pairs; one growing past it drops them).
The multiset is spliced with Counter arithmetic. Multiplicity matters because the merge stage accumulates token_count; identical raw pairs always merge together, which also lets the artifact store counts instead of repeated rows (1.25M pairs compress to 367k rows, 7.7MB on PowerToys).
Finalize (merge, min-lines filter, co-change weighting) always runs live against the current git metadata, so co-change weights never go stale.

The artifact also records which files the detector considered but gated out (minified, too small, over the token cap), so they are not mistaken for new files on every run, plus the total window count and guard flags.

Any validity miss falls back to the full pipeline, which rewrites the artifact: window-size or limits mismatch, truncated or timed-out persisted state, missing token-cache entries, a splice accounting mismatch, or more than max(16, 20% of files) moved. The health engine passes the changed set through on both the sync and async paths; full runs (init) are unchanged apart from persisting the artifact.

Equivalence

tests/unit/health/test_duplication_incremental.py holds 18 oracle tests asserting the incremental report equals a fresh full recompute, pair-for-pair as a multiset including token_count, plus duplication_pct and pairs_by_file: modify/add/delete/rename, no-op rewrites, intra-file clones, degenerate-cap transitions in both directions, chained splices, gated files staying gated and growing into participants, and every fallback path (too many changes, limits change, missing/corrupt artifact, truncated state). Token-cache retention across splice saves and live co-change weighting are covered too.

At scale: an A/B probe on PowerToys (5,481 parsed files) appends a real clone to a source file and verifies the spliced report equals a fresh full recompute exactly, for both the modification and a no-op change.

Numbers (1 run per config)

Metric	Before	After
`detect_clones` warm, isolated (PowerToys)	12.65s	3.40s
PowerToys `update` wall	138.1s	116.0s
hugo `update` wall	31-39s	27.7s
init wall (PowerToys / hugo)	unchanged	unchanged

Full unit suite: 5,015 passed.

A one-file update re-ran the full duplication pipeline (2.3M windows at PowerToys scale) because duplication_pct is repo-wide. Raw clone pairs are a pure function of the gate-surviving window set, and each window pair lives in exactly one hash bucket, so pairs between unchanged files cannot change. Persist the raw-pair multiset (count-compressed) in a versioned artifact next to the duplication token cache and, on incremental runs, re-verify only buckets touched by changed or deleted files: subtract each touched bucket's old contribution, add its new one, and keep everything else verbatim. The degenerate-bucket cap applies to each touched bucket's full membership, so cap transitions in either direction stay exact. Finalize (merge, min-lines filter, co-change weighting) always runs live against the current git metadata. Any validity miss falls back to the full pipeline, which rewrites the artifact: window/limits mismatch, truncated or timed-out persisted state, missing token-cache entries, multiset accounting mismatch, or more than max(16, 20% of files) moved. PowerToys (5,481 parsed files), isolated: warm full detect_clones 12.7s to 3.4s incremental, report equal pair-for-pair (multiset including token_count) and pct-for-pct against a fresh full recompute, for both a real modification and a no-op change.

repowise-bot · 2026-06-12T05:58:09Z

✅ Health: 7.6 (unchanged)
_{2 files moved · 3 hotspots · 5 hidden couplings · 2 with fix history}

🚨 Change risk: high _{(riskier than 76% of this repo's commits · raw 9.5/10)}
This change's risk is driven by:

large diff (many lines added)
scattered, high-entropy change

🩹 Review priority _{(files here with the most recent bug-fix history — defects cluster, so review these first)}

.../health/engine.py — fixed 4× in the last ~6 months
.../duplication/detector.py — fixed 2× in the last ~6 months

File	Score	Δ	Why
`.../duplication/token_cache.py`	9.5 → 9.3	▼ -0.2	✅ resolved dry violation
`.../duplication/detector.py`	4.8 → 5.2	▲ +0.3	🔻 introduced large method, complex method, bumpy road · ✅ resolved function hotspot, dry violation

🔥 Hotspots touched (3)

.../duplication/token_cache.py — 1 commits/90d, 2 dependents · primary owner: Raghav Chamadiya (100%)
.../health/engine.py — 13 commits/90d, 7 dependents · primary owner: Raghav Chamadiya (98%)
.../duplication/detector.py — 5 commits/90d, 3 dependents · primary owner: Raghav Chamadiya (100%)

🔗 Hidden coupling (1 file)

.../health/engine.py co-changes with these files (not in this PR):
- .../biomarkers/base.py (5× — 🟢 routine)
- .../complexity/walker.py (4× — 🟢 routine)
- docs/CODE_HEALTH.md (4× — 🟢 routine)
- .../biomarkers/README.md (4× — 🟢 routine)
- .../biomarkers/registry.py (4× — 🟢 routine)

📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-12 05:58 UTC
_{Silence on a single PR with [skip repowise] in the title · Per-repo toggle on repowise.dev/settings?tab=bot}

RaghavChamadiya requested a review from swati510 as a code owner June 12, 2026 05:57

swati510 approved these changes Jun 12, 2026

View reviewed changes

RaghavChamadiya merged commit d8d5790 into main Jun 12, 2026
5 checks passed

RaghavChamadiya deleted the perf/health-tail branch June 12, 2026 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(health): incremental duplication pair splice for update runs#460

perf(health): incremental duplication pair splice for update runs#460
RaghavChamadiya merged 1 commit into
mainfrom
perf/health-tail

RaghavChamadiya commented Jun 12, 2026

Uh oh!

repowise-bot Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

RaghavChamadiya commented Jun 12, 2026

Problem

Approach

Equivalence

Numbers (1 run per config)

Uh oh!

repowise-bot Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants