-
Notifications
You must be signed in to change notification settings - Fork 0
Good Egg v3: simplify scoring to merge_rate only for unknown contributors #46
Description
Summary
Simplify the GE scoring formula for unknown contributors (the population scored when skip_known_contributors=true). Drop hub_score and log_account_age from the v2 LR; use alltime merge_rate as the sole scoring input.
Evidence
Experiments on the bot-detection branch (PR #44), tracked in experiments/bot_detection/RESULTS.md:
hub_score hurts unknown contributors (stage17): merge_rate alone outperforms every model that includes hub_score across all repo size tiers:
| Tier | mr_only | mr+hub | Delta |
|---|---|---|---|
| All medium+ | 0.516 | 0.408 | -0.108 |
| Large (500-1999 PRs) | 0.553 | 0.484 | -0.069 |
| XL (2000+ PRs) | 0.533 | 0.405 | -0.128 |
log_account_age adds nothing (stage19): On 4 stable cutoffs (n=130 to n=1014, 5-fold CV), mr+age never beats mr_only. DeLong p > 0.07 at every cutoff. age_only AUC is 0.505-0.522 (barely above chance).
| Cutoff | N | mr_only | mr+age | DeLong p |
|---|---|---|---|---|
| T_2022 | 130 | 0.584 | 0.576 | 0.807 |
| T_2022-07 | 431 | 0.606 | 0.606 | 0.992 |
| T_2023 | 474 | 0.552 | 0.534 | 0.076 |
| T_2024 | 1014 | 0.580 | 0.569 | 0.111 |
Recency windows don't help (stage18): No significant difference between alltime, 2yr, 1yr, 6mo, or 3mo merge_rate for unknown contributors (zero significant DeLong tests across all tiers and cutoffs).
Cross-repo merge prediction confirms (stage13): hub_score hurts here too. ge_v2_proxy (hub_score + merge_rate) AUC 0.542 vs merge_rate_only AUC 0.576.
PR #27 validation study corroborates: account_age was LRT-significant (p = 1.2e-5) against graph_score but did not improve AUC ranking (DeLong p = 0.65 for GE + merge_rate + age vs GE alone).
Implementation tasks
- Remove hub_score (graph_score) from the scoring formula in
scorer.py - Remove log_account_age from the scoring formula in
scorer.py - Replace the v2 3-feature LR with alltime merge_rate as sole input for unknown contributors
- Keep graph construction (needed for repo discovery and contributor mapping)
- Keep
skip_known_contributorslogic (fast-tracks known contributors) - Update thresholds to map merge_rate directly to trust levels
- Update docs and config reference