STT Evaluation: Automated metric#639
Conversation
📝 WalkthroughWalkthroughAdds language-aware STT metric computation modules and triggers a low-priority Celery job from the STT run poller to compute per-sample scores and run-level aggregates when any batch succeeded. Changes
Sequence DiagramsequenceDiagram
participant Cron as STT Cron
participant Celery as Celery Task
participant DB as Database
participant Metrics as Metrics Service
rect rgba(200, 230, 201, 0.5)
Cron->>Cron: Poll and finalize STT run
end
alt Any batch succeeded
rect rgba(187, 222, 251, 0.5)
Cron->>Celery: Dispatch low-priority metric task (run_id, context)
Celery->>DB: Fetch STTResult records (status=SUCCESS, run_id)
DB-->>Celery: STTResult records
Celery->>DB: Batch-fetch STTSample & Language records
DB-->>Celery: Samples & Languages
Celery->>Metrics: calculate_stt_metrics(hyp, ref, lang) per result
Metrics-->>Celery: Per-sample scores
Celery->>DB: Bulk-update STTResult with scores
Celery->>Metrics: compute_run_aggregate_scores(result_scores)
Metrics-->>Celery: Run-level aggregates
Celery->>DB: Update STTRun with aggregate scores
DB-->>Celery: Updated run
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
backend/app/services/stt_evaluations/metrics.py (1)
14-14: ImportCallablefromcollections.abcfor Python 3.9+ compatibility.The static analysis correctly flags this. Since the project uses Python 3.12+, importing from
collections.abcis the recommended approach.♻️ Proposed fix
-from typing import Any, Callable +from collections.abc import Callable +from typing import Any🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/stt_evaluations/metrics.py` at line 14, The import of Callable in metrics.py should come from collections.abc for Python 3.9+ compatibility: update the imports so Callable is imported from collections.abc (keep Any from typing if still used) and remove Callable from typing; ensure any references to Callable in functions or type hints (e.g., in this module's metrics-related functions/classes) remain unchanged.backend/app/services/stt_evaluations/metric_job.py (2)
128-140: Consider wrapping updates in a single transaction.Currently, line 131 commits the bulk result updates, then line 136-140 updates the run separately. If the run update fails, individual result scores will be committed but run-level aggregates won't be stored. This inconsistency may be acceptable, but wrapping both in a single commit would ensure atomicity.
💡 Optional: Move commit after aggregate update
# Bulk update all result scores if score_updates: session.bulk_update_mappings(STTResult, score_updates) - session.commit() # Compute and store run-level aggregate scores if all_scores: aggregate = compute_run_aggregate_scores(all_scores) update_stt_run( session=session, run_id=run_id, score=aggregate, ) + + # Commit all updates together + if score_updates or all_scores: + session.commit()Note: This would require modifying
update_stt_runto not commit internally, which may have broader implications.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/stt_evaluations/metric_job.py` around lines 128 - 140, The bulk result updates (session.bulk_update_mappings on STTResult) are committed separately from the run-level update (update_stt_run), risking partial commits; wrap both the bulk updates and the compute_run_aggregate_scores -> update_stt_run call in a single transaction so they commit atomically (move session.commit to after update_stt_run), and adjust update_stt_run to accept/use the provided session without committing internally so the outer transaction controls the commit for run_id updates.
22-30: Consider documenting the unused parameters.The static analysis flags
job_id,task_instance,organization_id, andkwargsas unused. However, these are required by the Celery job execution framework (as shown in_execute_job_internalwhich passes them). Consider adding a brief comment or using_prefix convention for the truly unused ones while keepingtask_instanceavailable for potential progress reporting.💡 Optional: Add underscore prefix or comment
def execute_metric_computation( project_id: int, - job_id: str, + job_id: str, # noqa: ARG001 - Required by Celery job framework task_id: str, - task_instance: Any, - organization_id: int, + task_instance: Any, # noqa: ARG001 - Available for progress updates + organization_id: int, # noqa: ARG001 - Required by Celery job framework run_id: int, - **kwargs: Any, + **kwargs: Any, # noqa: ARG001 - Reserved for future extensibility ) -> dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/stt_evaluations/metric_job.py` around lines 22 - 30, The parameters job_id, organization_id, and kwargs are flagged as unused but are required by the Celery executor; rename them with a leading underscore (e.g., _job_id, _organization_id, **_kwargs) in the execute_metric_computation signature to indicate they are intentionally unused, keep task_instance named (and add a brief comment above the function like "# task_instance kept for possible progress reporting; other params required by Celery") so reviewers/static analysis know why they exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/services/stt_evaluations/metric_job.py`:
- Around line 128-140: The bulk result updates (session.bulk_update_mappings on
STTResult) are committed separately from the run-level update (update_stt_run),
risking partial commits; wrap both the bulk updates and the
compute_run_aggregate_scores -> update_stt_run call in a single transaction so
they commit atomically (move session.commit to after update_stt_run), and adjust
update_stt_run to accept/use the provided session without committing internally
so the outer transaction controls the commit for run_id updates.
- Around line 22-30: The parameters job_id, organization_id, and kwargs are
flagged as unused but are required by the Celery executor; rename them with a
leading underscore (e.g., _job_id, _organization_id, **_kwargs) in the
execute_metric_computation signature to indicate they are intentionally unused,
keep task_instance named (and add a brief comment above the function like "#
task_instance kept for possible progress reporting; other params required by
Celery") so reviewers/static analysis know why they exist.
In `@backend/app/services/stt_evaluations/metrics.py`:
- Line 14: The import of Callable in metrics.py should come from collections.abc
for Python 3.9+ compatibility: update the imports so Callable is imported from
collections.abc (keep Any from typing if still used) and remove Callable from
typing; ensure any references to Callable in functions or type hints (e.g., in
this module's metrics-related functions/classes) remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 79c6af55-2862-419a-b7ba-fafbfa6f26d6
⛔ Files ignored due to path filters (1)
backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
backend/app/crud/stt_evaluations/cron.pybackend/app/services/stt_evaluations/metric_job.pybackend/app/services/stt_evaluations/metrics.pybackend/pyproject.toml
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/app/services/stt_evaluations/metric_job.py (1)
129-141: Non-atomic commits between result scores and run aggregate.The result scores are committed at line 132, then the run aggregate is updated separately at lines 137-141. If the aggregate update fails after scores are committed, the run will lack aggregate scores while individual results have their scores set.
This is likely acceptable for a background job (re-running would be idempotent), but consider moving the commit after the aggregate update for atomicity:
♻️ Optional refactor for atomicity
# Bulk update all result scores if score_updates: session.execute(update(STTResult), score_updates) - session.commit() # Compute and store run-level aggregate scores if all_scores: aggregate = compute_run_aggregate_scores(all_scores) update_stt_run( session=session, run_id=run_id, score=aggregate, ) + + # Commit all changes atomically + if score_updates or all_scores: + session.commit()Note: This would require modifying
update_stt_runto not commit internally, which may have wider implications.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/stt_evaluations/metric_job.py` around lines 129 - 141, The commit of individual STTResult scores is done before computing/updating the run aggregate, risking partial state if update_stt_run fails; instead, apply the bulk update (session.execute(update(STTResult), score_updates)) but defer session.commit() until after compute_run_aggregate_scores(all_scores) and calling update_stt_run(...). Ensure update_stt_run accepts the existing session and does not commit internally (or add a new parameter to disable its internal commit) so the single session.commit() after both updates makes the operation atomic.backend/app/services/stt_evaluations/metrics.py (1)
14-14: ImportCallablefromcollections.abcfor Python 3.9+ compatibility.Per static analysis and Python best practices,
Callableshould be imported fromcollections.abcrather thantyping.♻️ Suggested fix
-from typing import Any, Callable +from collections.abc import Callable +from typing import Any🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/stt_evaluations/metrics.py` at line 14, Replace the current import of Callable from typing with the collections.abc variant to ensure Python 3.9+ compatibility: change the import statement that currently reads "from typing import Any, Callable" to import Any from typing (or typing as needed) and import Callable from collections.abc; update the import line(s) in metrics.py where "Callable" is referenced so existing type hints (e.g., any functions or variables using Callable) continue to resolve correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/services/stt_evaluations/metric_job.py`:
- Around line 129-141: The commit of individual STTResult scores is done before
computing/updating the run aggregate, risking partial state if update_stt_run
fails; instead, apply the bulk update (session.execute(update(STTResult),
score_updates)) but defer session.commit() until after
compute_run_aggregate_scores(all_scores) and calling update_stt_run(...). Ensure
update_stt_run accepts the existing session and does not commit internally (or
add a new parameter to disable its internal commit) so the single
session.commit() after both updates makes the operation atomic.
In `@backend/app/services/stt_evaluations/metrics.py`:
- Line 14: Replace the current import of Callable from typing with the
collections.abc variant to ensure Python 3.9+ compatibility: change the import
statement that currently reads "from typing import Any, Callable" to import Any
from typing (or typing as needed) and import Callable from collections.abc;
update the import line(s) in metrics.py where "Callable" is referenced so
existing type hints (e.g., any functions or variables using Callable) continue
to resolve correctly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 87aa10fd-e77b-42c3-8b95-60b76ec30adf
📒 Files selected for processing (2)
backend/app/services/stt_evaluations/metric_job.pybackend/app/services/stt_evaluations/metrics.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/app/tests/services/stt_evaluations/test_metric_job.py (2)
48-55: Return fresh kwargs per test instead of sharing a module-level mock.
BASE_KWARGSembeds a singleMagicMock()instance fortask_instance. Ifexecute_metric_computation()touches that object, state and call history can leak across tests. A small factory keeps each case isolated.Proposed refactor
-BASE_KWARGS: dict[str, Any] = { - "project_id": 1, - "job_id": "10", - "task_id": "celery-task-123", - "task_instance": MagicMock(), - "organization_id": 1, - "run_id": 10, -} +def _make_base_kwargs() -> dict[str, Any]: + return { + "project_id": 1, + "job_id": "10", + "task_id": "celery-task-123", + "task_instance": MagicMock(), + "organization_id": 1, + "run_id": 10, + }Then switch call sites to
execute_metric_computation(**_make_base_kwargs()). As per coding guidelines,backend/app/tests/**/*.py: "Use factory pattern for test fixtures in backend/app/tests/".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/services/stt_evaluations/test_metric_job.py` around lines 48 - 55, BASE_KWARGS currently holds a shared MagicMock() under "task_instance" which can leak state between tests; replace this pattern by creating a factory function (e.g., _make_base_kwargs) that returns a fresh dict with a new MagicMock() for "task_instance" each call, update all test call sites to use execute_metric_computation(**_make_base_kwargs()) instead of BASE_KWARGS, and remove or stop using the module-level BASE_KWARGS to ensure isolation in tests that call execute_metric_computation.
65-67: Add type hints to the injected patch mocks.These test methods annotate
-> Nonebut leave the injected mock parameters untyped. Annotating them asMagicMockkeeps the file aligned with the repo's typing rule.Example
def test_no_results_returns_zero_counts( - self, mock_session_cls, mock_now, mock_calc, mock_update_run + self, + mock_session_cls: MagicMock, + mock_now: MagicMock, + mock_calc: MagicMock, + mock_update_run: MagicMock, ) -> None:Apply the same pattern to the other patched test methods in this class. As per coding guidelines, "Always add type hints to all function parameters and return values in Python code" and "Use Python 3.11+ with type hints throughout the codebase".
Also applies to: 79-81, 120-122, 141-143, 163-165, 189-191, 223-225
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/services/stt_evaluations/test_metric_job.py` around lines 65 - 67, Update the test method signatures to add explicit type hints for the injected patch mocks (use typing.MagicMock) so parameters like mock_session_cls, mock_now, mock_calc, mock_update_run (and the other patched params in methods test_no_results_returns_zero_counts, plus the groups starting at lines referenced: the methods around lines 79-81, 120-122, 141-143, 163-165, 189-191, 223-225) are annotated as MagicMock; import MagicMock from typing if not present and apply the same pattern to all patched test methods in this test class so every injected mock parameter has a MagicMock type hint and return annotations remain -> None.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/tests/services/stt_evaluations/test_metric_job.py`:
- Around line 48-55: BASE_KWARGS currently holds a shared MagicMock() under
"task_instance" which can leak state between tests; replace this pattern by
creating a factory function (e.g., _make_base_kwargs) that returns a fresh dict
with a new MagicMock() for "task_instance" each call, update all test call sites
to use execute_metric_computation(**_make_base_kwargs()) instead of BASE_KWARGS,
and remove or stop using the module-level BASE_KWARGS to ensure isolation in
tests that call execute_metric_computation.
- Around line 65-67: Update the test method signatures to add explicit type
hints for the injected patch mocks (use typing.MagicMock) so parameters like
mock_session_cls, mock_now, mock_calc, mock_update_run (and the other patched
params in methods test_no_results_returns_zero_counts, plus the groups starting
at lines referenced: the methods around lines 79-81, 120-122, 141-143, 163-165,
189-191, 223-225) are annotated as MagicMock; import MagicMock from typing if
not present and apply the same pattern to all patched test methods in this test
class so every injected mock parameter has a MagicMock type hint and return
annotations remain -> None.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9b874bfe-9479-44f3-88ec-efe4c31beac0
📒 Files selected for processing (2)
backend/app/tests/services/stt_evaluations/test_metric_job.pybackend/app/tests/services/stt_evaluations/test_metrics.py
Summary
Target issue is #555
Explain the motivation for making this change. What existing problem does the pull request solve?
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
New Features
Tests
Chores