Skip to content

Scoring tests separately#350

Open
seroperson wants to merge 1 commit intoentrius:testfrom
seroperson:i339-tests-separate-scoring
Open

Scoring tests separately#350
seroperson wants to merge 1 commit intoentrius:testfrom
seroperson:i339-tests-separate-scoring

Conversation

@seroperson
Copy link
Copy Markdown

Closes #339

This PR addresses the issue that test code affect scoring process.

Before

  • Tests + source + non-code changes are counted for contribution bonus and also counted in density calculation.
  • Adding tests significantly reduce your score, because they give almost no score, but still affect density.
  • Adding non-code changes also affect the source density and may lower the score. The situation is better than with tests, as recognizable file (via extension) at least contribute some score. But unrecognized files still always lower the score, as they give zero score and affect density.
  • Situation is the same with deleted and binary files: they contribute no score, but still affect the density.

After

  • Test + source + non-code changes has its' own density and calculate its' own score.
  • Only source changes are counted for contribution bonus.
  • After each category is scored calculated, they are summed.

Implementation details

  • Added ScoringCategory enum. Possible values are TEST (if it's a test file), SOURCE (if scoring method is tree_sitter and it's not a test file), NON_CODE (everything else: recognized non-code changes, unrecognized non-code changes, deleted files, binary files)
  • Added PrScoringResultCategorized, which holds PrScoringResult per each existing category.
  • Scoring logic moved from scoring.py into PrScoringResultCategorized methods.

Test cases

Test files:

  • test_adding_tests_does_not_reduce_score - Adding test files to a source PR never lowers the base score
  • test_tests_do_not_affect_contribution_bonus - Small and large test files produce modest, similar increases (test weight is 0.05x)
  • test_same_code_in_test_path_scores_much_lower - Identical code in a test directory scores 10x+ lower than in a source path
  • test_tests_do_not_affect_threshold - Test files can't push a below-threshold PR past the token score threshold

Non-code files:

  • test_adding_non_code_files_does_not_reduce_score - Adding non-code files (markdown, yaml) never lowers the base score
  • test_non_code_does_not_affect_contribution_bonus - Small and large non-code files produce the same density increase (no bonus impact)
  • test_source_code_scores_much_higher_than_non_code - Tree-diff scored source code scores 10x+ higher than line-count scored non-code files
  • test_non_code_does_not_affect_threshold - Non-code files can't push a below-threshold PR past the token score threshold

Zero-score files:

  • test_deleted_file_does_not_change_score - Deleted files (score=0) don't affect the base score
  • test_unsupported_file_does_not_change_score - Unsupported extensions (score=0) don't affect the base score

Density:

  • test_adding_test_category_increases_score_beyond_single_cap - Per-category density cap allows multiple categories to contribute independently
  • test_verbose_formatting_decreases_score - Same logic in more lines produces lower density and lower score
  • test_modified_file_scores_diff_only - Modified files score only the AST diff, not the entire file

Threshold:

  • test_below_threshold_scores_less - Trivial changes below token score threshold score less than substantial changes

@anderdc
Copy link
Copy Markdown
Collaborator

anderdc commented Apr 7, 2026

just merged big changes into test please fix conflicts thanks

@seroperson seroperson force-pushed the i339-tests-separate-scoring branch 2 times, most recently from d9ff4cd to d8a9df5 Compare April 7, 2026 08:50
@seroperson
Copy link
Copy Markdown
Author

@anderdc I've rebased it on current test 👍

@seroperson seroperson force-pushed the i339-tests-separate-scoring branch from d8a9df5 to 4ade5b6 Compare April 9, 2026 23:23
@anderdc
Copy link
Copy Markdown
Collaborator

anderdc commented Apr 10, 2026

I have a PR going into test now that will fix that pyright type check CI, don't worry about it

Copy link
Copy Markdown
Collaborator

@anderdc anderdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Good problem identification — test and non-code files dragging down source scores via density is a real issue and the per-category approach is the right direction. The ScoringCategory enum and the category property on FileScoreResult are clean additions. The test suite is thorough and covers the right invariants.

However, the implementation adds more scaffolding than necessary and introduces a redundant iteration. Requesting changes on the following:

1. Don't loop twice over file results

from_file_results() re-iterates all file_results to categorize and re-sum totals that the existing loop in tree_sitter_scoring.py already computes. The category is trivially known at each append site (is_test_file + scoring_method), so just accumulate into per-category dicts inline during the existing loop. No second pass needed.

2. Extend PrScoringResult instead of adding PrScoringResultCategorized

The new wrapper class duplicates fields from PrScoringResult (total_score, total_nodes_scored, score_breakdown) and adds indirection. Since PrScoringResult is only constructed in one place and consumed in one place, just add by_category: Dict[ScoringCategory, PrScoringResult] and total_lines: int to the existing class. No need for a parallel class hierarchy or the _EMPTY_SCORING_RESULT sentinel.

3. Keep scoring logic in scoring.py

calculate_initial_base_score() and calculate_contribution_bonus() are business logic that reference scoring constants (MERGED_PR_BASE_SCORE, MAX_CONTRIBUTION_BONUS, CONTRIBUTION_SCORE_FOR_FULL_BONUS). This logic currently lives in scoring.py alongside the rest of the scoring decisions. Moving it into a dataclass in classes.py scatters scoring logic across two files. Keep it in scoring.py — the caller can iterate by_category directly with a few lines.

4. Threshold check may not isolate categories correctly

In calculate_initial_base_score(), the threshold uses self.score_breakdown.total_score — the aggregate across all categories including test and non-code. A PR with trivial source but a large test suite could pass the threshold via test score alone. The existing tests for this (test_tests_do_not_affect_threshold) pass token_score=0.0 explicitly on the factory, so they don't catch this path. The intent is that only SOURCE contributions matter for the threshold, the check should use the source category's score, not the aggregate.

5. Preserve diagnostic detail in log output

The current log shows density and bonus percentage:

Base score: 15.00 (density 0.50) + 3.0 bonus (10% of max 30) = 18.00

The PR reduces this to:

Base score: 15.00 + 3.0 bonus = 18.00

With per-category scoring, the density breakdown is actually more useful to log, not less. Please preserve or improve the diagnostic output.

@anderdc anderdc force-pushed the test branch 2 times, most recently from 8adcdf6 to 9c9860f Compare April 10, 2026 17:07
@seroperson seroperson force-pushed the i339-tests-separate-scoring branch from 639d9d5 to 60b3b9c Compare April 11, 2026 01:33
@seroperson seroperson force-pushed the i339-tests-separate-scoring branch from 60b3b9c to 8868d19 Compare April 11, 2026 03:10
@seroperson
Copy link
Copy Markdown
Author

Rebased + addressed all points

@seroperson seroperson requested a review from anderdc April 11, 2026 03:14
@anderdc
Copy link
Copy Markdown
Collaborator

anderdc commented Apr 12, 2026

Each category gets its own density capped at MAX_CODE_DENSITY_MULTIPLIER, so the theoretical ceiling is 3x what it was before. In practice test/non-code scores are tiny, but the cap is implicit rather than by design.

How about we just compute density on SOURCE only, and let tests/non-code contribute toward the contribution bonus instead? That's already what was going to be introduced with this PR anyways, thoughts?

@seroperson
Copy link
Copy Markdown
Author

seroperson commented Apr 12, 2026

How about we just compute density on SOURCE only, and let tests/non-code contribute toward the contribution bonus instead? That's already what was going to be introduced with this PR anyways, thoughts?

We can, but we have to keep in mind that:

  • With current PR implementation writing more tests doesn't mean that you'll get more score (because of density). You can write 2k LOC of tests and it'll give you the same score as 200 LOC. Likely it'll never reach the cap because lines are growing much faster than x0.05 test's score.
  • With what you propose test impact is scaling linearly: the more tests you write, the more score you'll get (until capped). Theoretically you can write 10k LOC of tests and get your capped contribution_bonus for that, which is impossible with density approach.

However, with current implementation maybe it's a little bit unfair that you can achieve the same test score for both 200 LOC and 2k LOC. On the other hand, that's how it currently works for production code scoring: because of density you can achieve nearly the same score for both small and big changes (the only diff is contribution bonus, but it adds just a little). A PR which consists of a well-written 20-line function can easily beat a much larger PR with much more token score but worse density. Maybe it's unfair too, but it's how it works now (the good part that it can be tweaked to fairer values only by changing constants, no algorithm changes are required, but it's a different story I guess).

So, what I can suggest here:

  • test and non-code changes don't have a density and contribute only to contribution_bonus
  • contribution_bonus will look like min(source_contribution_bonus + test_and_non_code_contribution_bonus, MAX_CONTRIBUTION_BONUS)
  • test_and_non_code_contribution_bonus is capped at 5% (?) of initial_base_score + source_contribution_bonus, where initial_base_score is MERGED_PR_BASE_SCORE * source.density.
  • This way you will still get more score for more tests, but it won't grow until MAX_CONTRIBUTION_BONUS cap, like source changes do.

@anderdc
Copy link
Copy Markdown
Collaborator

anderdc commented Apr 13, 2026

My intention with the current state of scoring is to reward targeted PRs, in general yes that's generally smaller things, and the smaller/more concise the better. Perhaps yes, it should be dialed back a tad to target medium sized PRs (my favorite sweet spot), but that's a conversation for another thread.

In the meantime

  • yes, change so no density required on non_code and test contributions, just do source code density capped at the constants.py value
  • let's NOT cap the max contribution bonus for tests, not saying I like or dislike it, need to chew on that type of change + if we did it it should be it's own PR. Although what you said is correct, me personally as a maintainer, I will not allow disproportionate testing PRs, this one is even pushing it a bit. I want to get this PR in already.

Implement that and we should be good to merge pending final checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tests affect code density and reduce resulting score very much

2 participants