Scoring tests separately by seroperson · Pull Request #350 · entrius/gittensor

seroperson · 2026-04-07T00:23:23Z

Closes #339

This PR addresses the issue that test code affect scoring process.

Before

Tests + source + non-code changes are counted for contribution bonus and also counted in density calculation.
Adding tests significantly reduce your score, because they give almost no score, but still affect density.
Adding non-code changes also affect the source density and may lower the score. The situation is better than with tests, as recognizable file (via extension) at least contribute some score. But unrecognized files still always lower the score, as they give zero score and affect density.
Situation is the same with deleted and binary files: they contribute no score, but still affect the density.

After

Test + source + non-code changes has its' own density and calculate its' own score.
Only source changes are counted for contribution bonus.
After each category is scored calculated, they are summed.

Implementation details

Added ScoringCategory enum. Possible values are TEST (if it's a test file), SOURCE (if scoring method is tree_sitter and it's not a test file), NON_CODE (everything else: recognized non-code changes, unrecognized non-code changes, deleted files, binary files)
Added PrScoringResultCategorized, which holds PrScoringResult per each existing category.
Scoring logic moved from scoring.py into PrScoringResultCategorized methods.

Test cases

Test files:

test_adding_tests_does_not_reduce_score - Adding test files to a source PR never lowers the base score
test_tests_do_not_affect_contribution_bonus - Small and large test files produce modest, similar increases (test weight is 0.05x)
test_same_code_in_test_path_scores_much_lower - Identical code in a test directory scores 10x+ lower than in a source path
test_tests_do_not_affect_threshold - Test files can't push a below-threshold PR past the token score threshold

Non-code files:

test_adding_non_code_files_does_not_reduce_score - Adding non-code files (markdown, yaml) never lowers the base score
test_non_code_does_not_affect_contribution_bonus - Small and large non-code files produce the same density increase (no bonus impact)
test_source_code_scores_much_higher_than_non_code - Tree-diff scored source code scores 10x+ higher than line-count scored non-code files
test_non_code_does_not_affect_threshold - Non-code files can't push a below-threshold PR past the token score threshold

Zero-score files:

test_deleted_file_does_not_change_score - Deleted files (score=0) don't affect the base score
test_unsupported_file_does_not_change_score - Unsupported extensions (score=0) don't affect the base score

Density:

test_adding_test_category_increases_score_beyond_single_cap - Per-category density cap allows multiple categories to contribute independently
test_verbose_formatting_decreases_score - Same logic in more lines produces lower density and lower score
test_modified_file_scores_diff_only - Modified files score only the AST diff, not the entire file

Threshold:

test_below_threshold_scores_less - Trivial changes below token score threshold score less than substantial changes

anderdc · 2026-04-07T01:29:29Z

just merged big changes into test please fix conflicts thanks

seroperson · 2026-04-09T07:42:37Z

@anderdc I've rebased it on current test 👍

anderdc · 2026-04-10T15:52:44Z

I have a PR going into test now that will fix that pyright type check CI, don't worry about it

anderdc

Review

Good problem identification — test and non-code files dragging down source scores via density is a real issue and the per-category approach is the right direction. The ScoringCategory enum and the category property on FileScoreResult are clean additions. The test suite is thorough and covers the right invariants.

However, the implementation adds more scaffolding than necessary and introduces a redundant iteration. Requesting changes on the following:

1. Don't loop twice over file results

from_file_results() re-iterates all file_results to categorize and re-sum totals that the existing loop in tree_sitter_scoring.py already computes. The category is trivially known at each append site (is_test_file + scoring_method), so just accumulate into per-category dicts inline during the existing loop. No second pass needed.

2. Extend `PrScoringResult` instead of adding `PrScoringResultCategorized`

The new wrapper class duplicates fields from PrScoringResult (total_score, total_nodes_scored, score_breakdown) and adds indirection. Since PrScoringResult is only constructed in one place and consumed in one place, just add by_category: Dict[ScoringCategory, PrScoringResult] and total_lines: int to the existing class. No need for a parallel class hierarchy or the _EMPTY_SCORING_RESULT sentinel.

3. Keep scoring logic in `scoring.py`

calculate_initial_base_score() and calculate_contribution_bonus() are business logic that reference scoring constants (MERGED_PR_BASE_SCORE, MAX_CONTRIBUTION_BONUS, CONTRIBUTION_SCORE_FOR_FULL_BONUS). This logic currently lives in scoring.py alongside the rest of the scoring decisions. Moving it into a dataclass in classes.py scatters scoring logic across two files. Keep it in scoring.py — the caller can iterate by_category directly with a few lines.

4. Threshold check may not isolate categories correctly

In calculate_initial_base_score(), the threshold uses self.score_breakdown.total_score — the aggregate across all categories including test and non-code. A PR with trivial source but a large test suite could pass the threshold via test score alone. The existing tests for this (test_tests_do_not_affect_threshold) pass token_score=0.0 explicitly on the factory, so they don't catch this path. The intent is that only SOURCE contributions matter for the threshold, the check should use the source category's score, not the aggregate.

5. Preserve diagnostic detail in log output

The current log shows density and bonus percentage:

Base score: 15.00 (density 0.50) + 3.0 bonus (10% of max 30) = 18.00

The PR reduces this to:

Base score: 15.00 + 3.0 bonus = 18.00

With per-category scoring, the density breakdown is actually more useful to log, not less. Please preserve or improve the diagnostic output.

seroperson · 2026-04-11T03:14:04Z

Rebased + addressed all points

anderdc · 2026-04-12T20:52:49Z

Each category gets its own density capped at MAX_CODE_DENSITY_MULTIPLIER, so the theoretical ceiling is 3x what it was before. In practice test/non-code scores are tiny, but the cap is implicit rather than by design.

How about we just compute density on SOURCE only, and let tests/non-code contribute toward the contribution bonus instead? That's already what was going to be introduced with this PR anyways, thoughts?

seroperson · 2026-04-12T23:33:19Z

How about we just compute density on SOURCE only, and let tests/non-code contribute toward the contribution bonus instead? That's already what was going to be introduced with this PR anyways, thoughts?

We can, but we have to keep in mind that:

With current PR implementation writing more tests doesn't mean that you'll get more score (because of density). You can write 2k LOC of tests and it'll give you the same score as 200 LOC. Likely it'll never reach the cap because lines are growing much faster than x0.05 test's score.
With what you propose test impact is scaling linearly: the more tests you write, the more score you'll get (until capped). Theoretically you can write 10k LOC of tests and get your capped contribution_bonus for that, which is impossible with density approach.

However, with current implementation maybe it's a little bit unfair that you can achieve the same test score for both 200 LOC and 2k LOC. On the other hand, that's how it currently works for production code scoring: because of density you can achieve nearly the same score for both small and big changes (the only diff is contribution bonus, but it adds just a little). A PR which consists of a well-written 20-line function can easily beat a much larger PR with much more token score but worse density. Maybe it's unfair too, but it's how it works now (the good part that it can be tweaked to fairer values only by changing constants, no algorithm changes are required, but it's a different story I guess).

So, what I can suggest here:

test and non-code changes don't have a density and contribute only to contribution_bonus
contribution_bonus will look like min(source_contribution_bonus + test_and_non_code_contribution_bonus, MAX_CONTRIBUTION_BONUS)
test_and_non_code_contribution_bonus is capped at 5% (?) of initial_base_score + source_contribution_bonus, where initial_base_score is MERGED_PR_BASE_SCORE * source.density.
This way you will still get more score for more tests, but it won't grow until MAX_CONTRIBUTION_BONUS cap, like source changes do.

anderdc · 2026-04-13T00:24:00Z

My intention with the current state of scoring is to reward targeted PRs, in general yes that's generally smaller things, and the smaller/more concise the better. Perhaps yes, it should be dialed back a tad to target medium sized PRs (my favorite sweet spot), but that's a conversation for another thread.

In the meantime

yes, change so no density required on non_code and test contributions, just do source code density capped at the constants.py value
let's NOT cap the max contribution bonus for tests, not saying I like or dislike it, need to chew on that type of change + if we did it it should be it's own PR. Although what you said is correct, me personally as a maintainer, I will not allow disproportionate testing PRs, this one is even pushing it a bit. I want to get this PR in already.

Implement that and we should be good to merge pending final checks

seroperson force-pushed the i339-tests-separate-scoring branch 2 times, most recently from d9ff4cd to d8a9df5 Compare April 7, 2026 08:50

seroperson force-pushed the i339-tests-separate-scoring branch from d8a9df5 to 4ade5b6 Compare April 9, 2026 23:23

anderdc requested changes Apr 10, 2026

View reviewed changes

anderdc force-pushed the test branch 2 times, most recently from 8adcdf6 to 9c9860f Compare April 10, 2026 17:07

seroperson force-pushed the i339-tests-separate-scoring branch from 639d9d5 to 60b3b9c Compare April 11, 2026 01:33

Scoring tests separately

8868d19

seroperson force-pushed the i339-tests-separate-scoring branch from 60b3b9c to 8868d19 Compare April 11, 2026 03:10

seroperson requested a review from anderdc April 11, 2026 03:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring tests separately#350

Scoring tests separately#350
seroperson wants to merge 1 commit intoentrius:testfrom
seroperson:i339-tests-separate-scoring

seroperson commented Apr 7, 2026

Uh oh!

anderdc commented Apr 7, 2026

Uh oh!

seroperson commented Apr 9, 2026

Uh oh!

anderdc commented Apr 10, 2026

Uh oh!

anderdc left a comment •

edited

Loading

Uh oh!

seroperson commented Apr 11, 2026

Uh oh!

anderdc commented Apr 12, 2026 •

edited

Loading

Uh oh!

seroperson commented Apr 12, 2026 •

edited

Loading

Uh oh!

anderdc commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seroperson commented Apr 7, 2026

Before

After

Implementation details

Test cases

Uh oh!

anderdc commented Apr 7, 2026

Uh oh!

seroperson commented Apr 9, 2026

Uh oh!

anderdc commented Apr 10, 2026

Uh oh!

anderdc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review

1. Don't loop twice over file results

2. Extend PrScoringResult instead of adding PrScoringResultCategorized

3. Keep scoring logic in scoring.py

4. Threshold check may not isolate categories correctly

5. Preserve diagnostic detail in log output

Uh oh!

seroperson commented Apr 11, 2026

Uh oh!

anderdc commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seroperson commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anderdc commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anderdc left a comment •

edited

Loading

2. Extend `PrScoringResult` instead of adding `PrScoringResultCategorized`

3. Keep scoring logic in `scoring.py`

anderdc commented Apr 12, 2026 •

edited

Loading

seroperson commented Apr 12, 2026 •

edited

Loading