Skip to content

perf(health): incremental duplication pair splice for update runs#460

Merged
RaghavChamadiya merged 1 commit into
mainfrom
perf/health-tail
Jun 12, 2026
Merged

perf(health): incremental duplication pair splice for update runs#460
RaghavChamadiya merged 1 commit into
mainfrom
perf/health-tail

Conversation

@RaghavChamadiya

Copy link
Copy Markdown
Member

Problem

A one-file repowise update re-runs the full duplication pipeline because duplication_pct is repo-wide: on a PowerToys-scale repo that means rebuilding 2.3M rolling-hash windows and re-verifying 1.25M candidate pairs to discover that almost nothing changed. The token cache (added earlier) removed the re-tokenize cost, but the windowing, bucketing, and verification stages still ran flat on every update, ~11-13s warm.

Approach

Raw clone pairs are a pure function of the gate-surviving window set, and each window pair lives in exactly one hash bucket, so pairs between unchanged files cannot change. This PR persists the raw-pair multiset in a versioned artifact (duplication_pairs.pkl) next to the duplication token cache, and detect_clones gains a changed_files parameter. On incremental runs it splices instead of recomputing:

  • Re-verify only buckets touched by a changed or deleted file (old windows come from the token cache by old content hash; new windows are computed live for the changed files only).
  • For every touched bucket, subtract its old contribution and add its new one. Both are recomputed deterministically against the bucket's full membership, so the degenerate-bucket cap transitions correctly in either direction (a bucket shrinking below the cap revives its pairs; one growing past it drops them).
  • The multiset is spliced with Counter arithmetic. Multiplicity matters because the merge stage accumulates token_count; identical raw pairs always merge together, which also lets the artifact store counts instead of repeated rows (1.25M pairs compress to 367k rows, 7.7MB on PowerToys).
  • Finalize (merge, min-lines filter, co-change weighting) always runs live against the current git metadata, so co-change weights never go stale.

The artifact also records which files the detector considered but gated out (minified, too small, over the token cap), so they are not mistaken for new files on every run, plus the total window count and guard flags.

Any validity miss falls back to the full pipeline, which rewrites the artifact: window-size or limits mismatch, truncated or timed-out persisted state, missing token-cache entries, a splice accounting mismatch, or more than max(16, 20% of files) moved. The health engine passes the changed set through on both the sync and async paths; full runs (init) are unchanged apart from persisting the artifact.

Equivalence

tests/unit/health/test_duplication_incremental.py holds 18 oracle tests asserting the incremental report equals a fresh full recompute, pair-for-pair as a multiset including token_count, plus duplication_pct and pairs_by_file: modify/add/delete/rename, no-op rewrites, intra-file clones, degenerate-cap transitions in both directions, chained splices, gated files staying gated and growing into participants, and every fallback path (too many changes, limits change, missing/corrupt artifact, truncated state). Token-cache retention across splice saves and live co-change weighting are covered too.

At scale: an A/B probe on PowerToys (5,481 parsed files) appends a real clone to a source file and verifies the spliced report equals a fresh full recompute exactly, for both the modification and a no-op change.

Numbers (1 run per config)

Metric Before After
detect_clones warm, isolated (PowerToys) 12.65s 3.40s
PowerToys update wall 138.1s 116.0s
hugo update wall 31-39s 27.7s
init wall (PowerToys / hugo) unchanged unchanged

Full unit suite: 5,015 passed.

A one-file update re-ran the full duplication pipeline (2.3M windows at
PowerToys scale) because duplication_pct is repo-wide. Raw clone pairs
are a pure function of the gate-surviving window set, and each window
pair lives in exactly one hash bucket, so pairs between unchanged files
cannot change. Persist the raw-pair multiset (count-compressed) in a
versioned artifact next to the duplication token cache and, on
incremental runs, re-verify only buckets touched by changed or deleted
files: subtract each touched bucket's old contribution, add its new one,
and keep everything else verbatim. The degenerate-bucket cap applies to
each touched bucket's full membership, so cap transitions in either
direction stay exact. Finalize (merge, min-lines filter, co-change
weighting) always runs live against the current git metadata.

Any validity miss falls back to the full pipeline, which rewrites the
artifact: window/limits mismatch, truncated or timed-out persisted
state, missing token-cache entries, multiset accounting mismatch, or
more than max(16, 20% of files) moved.

PowerToys (5,481 parsed files), isolated: warm full detect_clones 12.7s
to 3.4s incremental, report equal pair-for-pair (multiset including
token_count) and pct-for-pct against a fresh full recompute, for both a
real modification and a no-op change.
@repowise-bot

repowise-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

✅ Health: 7.6 (unchanged)
2 files moved · 3 hotspots · 5 hidden couplings · 2 with fix history

🚨 Change risk: high (riskier than 76% of this repo's commits · raw 9.5/10)
This change's risk is driven by:

  • large diff (many lines added)
  • scattered, high-entropy change

🩹 Review priority (files here with the most recent bug-fix history — defects cluster, so review these first)

File Score Δ Why
.../duplication/token_cache.py 9.5 → 9.3 ▼ -0.2 ✅ resolved dry violation
.../duplication/detector.py 4.8 → 5.2 ▲ +0.3 🔻 introduced large method, complex method, bumpy road · ✅ resolved function hotspot, dry violation
🔥 Hotspots touched (3)
  • .../duplication/token_cache.py — 1 commits/90d, 2 dependents · primary owner: Raghav Chamadiya (100%)
  • .../health/engine.py — 13 commits/90d, 7 dependents · primary owner: Raghav Chamadiya (98%)
  • .../duplication/detector.py — 5 commits/90d, 3 dependents · primary owner: Raghav Chamadiya (100%)
🔗 Hidden coupling (1 file)
  • .../health/engine.py co-changes with these files (not in this PR):
    • .../biomarkers/base.py (5× — 🟢 routine)
    • .../complexity/walker.py (4× — 🟢 routine)
    • docs/CODE_HEALTH.md (4× — 🟢 routine)
    • .../biomarkers/README.md (4× — 🟢 routine)
    • .../biomarkers/registry.py (4× — 🟢 routine)

📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-12 05:58 UTC
Silence on a single PR with [skip repowise] in the title · Per-repo toggle on repowise.dev/settings?tab=bot

@RaghavChamadiya RaghavChamadiya merged commit d8d5790 into main Jun 12, 2026
5 checks passed
@RaghavChamadiya RaghavChamadiya deleted the perf/health-tail branch June 12, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants