Conversation
|
just merged big changes into test please fix conflicts thanks |
d9ff4cd to
d8a9df5
Compare
|
@anderdc I've rebased it on current |
d8a9df5 to
4ade5b6
Compare
|
I have a PR going into test now that will fix that pyright type check CI, don't worry about it |
There was a problem hiding this comment.
Review
Good problem identification — test and non-code files dragging down source scores via density is a real issue and the per-category approach is the right direction. The ScoringCategory enum and the category property on FileScoreResult are clean additions. The test suite is thorough and covers the right invariants.
However, the implementation adds more scaffolding than necessary and introduces a redundant iteration. Requesting changes on the following:
1. Don't loop twice over file results
from_file_results() re-iterates all file_results to categorize and re-sum totals that the existing loop in tree_sitter_scoring.py already computes. The category is trivially known at each append site (is_test_file + scoring_method), so just accumulate into per-category dicts inline during the existing loop. No second pass needed.
2. Extend PrScoringResult instead of adding PrScoringResultCategorized
The new wrapper class duplicates fields from PrScoringResult (total_score, total_nodes_scored, score_breakdown) and adds indirection. Since PrScoringResult is only constructed in one place and consumed in one place, just add by_category: Dict[ScoringCategory, PrScoringResult] and total_lines: int to the existing class. No need for a parallel class hierarchy or the _EMPTY_SCORING_RESULT sentinel.
3. Keep scoring logic in scoring.py
calculate_initial_base_score() and calculate_contribution_bonus() are business logic that reference scoring constants (MERGED_PR_BASE_SCORE, MAX_CONTRIBUTION_BONUS, CONTRIBUTION_SCORE_FOR_FULL_BONUS). This logic currently lives in scoring.py alongside the rest of the scoring decisions. Moving it into a dataclass in classes.py scatters scoring logic across two files. Keep it in scoring.py — the caller can iterate by_category directly with a few lines.
4. Threshold check may not isolate categories correctly
In calculate_initial_base_score(), the threshold uses self.score_breakdown.total_score — the aggregate across all categories including test and non-code. A PR with trivial source but a large test suite could pass the threshold via test score alone. The existing tests for this (test_tests_do_not_affect_threshold) pass token_score=0.0 explicitly on the factory, so they don't catch this path. The intent is that only SOURCE contributions matter for the threshold, the check should use the source category's score, not the aggregate.
5. Preserve diagnostic detail in log output
The current log shows density and bonus percentage:
Base score: 15.00 (density 0.50) + 3.0 bonus (10% of max 30) = 18.00
The PR reduces this to:
Base score: 15.00 + 3.0 bonus = 18.00
With per-category scoring, the density breakdown is actually more useful to log, not less. Please preserve or improve the diagnostic output.
8adcdf6 to
9c9860f
Compare
639d9d5 to
60b3b9c
Compare
60b3b9c to
8868d19
Compare
|
Rebased + addressed all points |
|
Each category gets its own density capped at How about we just compute density on SOURCE only, and let tests/non-code contribute toward the contribution bonus instead? That's already what was going to be introduced with this PR anyways, thoughts? |
We can, but we have to keep in mind that:
However, with current implementation maybe it's a little bit unfair that you can achieve the same test score for both 200 LOC and 2k LOC. On the other hand, that's how it currently works for production code scoring: because of density you can achieve nearly the same score for both small and big changes (the only diff is contribution bonus, but it adds just a little). A PR which consists of a well-written 20-line function can easily beat a much larger PR with much more token score but worse density. Maybe it's unfair too, but it's how it works now (the good part that it can be tweaked to fairer values only by changing constants, no algorithm changes are required, but it's a different story I guess). So, what I can suggest here:
|
|
My intention with the current state of scoring is to reward targeted PRs, in general yes that's generally smaller things, and the smaller/more concise the better. Perhaps yes, it should be dialed back a tad to target medium sized PRs (my favorite sweet spot), but that's a conversation for another thread. In the meantime
Implement that and we should be good to merge pending final checks |
Closes #339
This PR addresses the issue that test code affect scoring process.
Before
After
Implementation details
ScoringCategoryenum. Possible values areTEST(if it's a test file),SOURCE(if scoring method istree_sitterand it's not a test file),NON_CODE(everything else: recognized non-code changes, unrecognized non-code changes, deleted files, binary files)PrScoringResultCategorized, which holdsPrScoringResultper each existing category.scoring.pyintoPrScoringResultCategorizedmethods.Test cases
Test files:
test_adding_tests_does_not_reduce_score- Adding test files to a source PR never lowers the base scoretest_tests_do_not_affect_contribution_bonus- Small and large test files produce modest, similar increases (test weight is 0.05x)test_same_code_in_test_path_scores_much_lower- Identical code in a test directory scores 10x+ lower than in a source pathtest_tests_do_not_affect_threshold- Test files can't push a below-threshold PR past the token score thresholdNon-code files:
test_adding_non_code_files_does_not_reduce_score- Adding non-code files (markdown, yaml) never lowers the base scoretest_non_code_does_not_affect_contribution_bonus- Small and large non-code files produce the same density increase (no bonus impact)test_source_code_scores_much_higher_than_non_code- Tree-diff scored source code scores 10x+ higher than line-count scored non-code filestest_non_code_does_not_affect_threshold- Non-code files can't push a below-threshold PR past the token score thresholdZero-score files:
test_deleted_file_does_not_change_score- Deleted files (score=0) don't affect the base scoretest_unsupported_file_does_not_change_score- Unsupported extensions (score=0) don't affect the base scoreDensity:
test_adding_test_category_increases_score_beyond_single_cap- Per-category density cap allows multiple categories to contribute independentlytest_verbose_formatting_decreases_score- Same logic in more lines produces lower density and lower scoretest_modified_file_scores_diff_only- Modified files score only the AST diff, not the entire fileThreshold:
test_below_threshold_scores_less- Trivial changes below token score threshold score less than substantial changes