Skip to content

STT Evaluation: Automated metric#639

Merged
AkhileshNegi merged 9 commits into
mainfrom
enhancement/stt-automated-metric
Mar 6, 2026
Merged

STT Evaluation: Automated metric#639
AkhileshNegi merged 9 commits into
mainfrom
enhancement/stt-automated-metric

Conversation

@AkhileshNegi
Copy link
Copy Markdown
Collaborator

@AkhileshNegi AkhileshNegi commented Mar 5, 2026

Summary

Target issue is #555
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Automatic background metric computation triggered after STT evaluations when any batch succeeds.
    • Per-sample metrics: WER, CER, WIP, and lenient WER.
    • Language-aware text normalization (Indic languages, Assamese, English) to improve metric accuracy.
    • Run-level aggregate scores (mean, standard deviation, counts).
  • Tests

    • Comprehensive unit tests for metric computation, normalization, edge cases, and aggregate scoring.
  • Chores

    • Added runtime dependencies for metric calculation/normalization; minor docstring and import cleanups.

@AkhileshNegi AkhileshNegi self-assigned this Mar 5, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Adds language-aware STT metric computation modules and triggers a low-priority Celery job from the STT run poller to compute per-sample scores and run-level aggregates when any batch succeeded.

Changes

Cohort / File(s) Summary
Metric computation modules
backend/app/services/stt_evaluations/metrics.py, backend/app/services/stt_evaluations/metric_job.py
New modules: metrics.py implements normalization and per-sample metrics (WER, CER, WIP, lenient WER) and run aggregation; metric_job.py implements execute_metric_computation to fetch successful STTResult rows, compute scores, bulk-update results, and update run aggregates.
Task dispatch
backend/app/crud/stt_evaluations/cron.py
Poller now conditionally dispatches a low-priority Celery metric computation task (with run_id and context) when any batch succeeded; logs dispatch and handles dispatch-time exceptions.
Dependencies
backend/pyproject.toml
Adds runtime dependencies: jiwer>=3.1.0, indic-nlp-library>=0.92, and whisper-normalizer>=0.1.12.
Tests
backend/app/tests/services/stt_evaluations/test_metric_job.py, backend/app/tests/services/stt_evaluations/test_metrics.py
New unit tests covering normalization, per-sample metric calculations, aggregation, job execution paths (scored/skipped/failed), and DB interaction mocks.
Minor cleanup
backend/app/models/llm/response.py, backend/app/services/llm/providers/registry.py
Small docstring typo fix and removal of unused imports; no functional changes.

Sequence Diagram

sequenceDiagram
    participant Cron as STT Cron
    participant Celery as Celery Task
    participant DB as Database
    participant Metrics as Metrics Service

    rect rgba(200, 230, 201, 0.5)
    Cron->>Cron: Poll and finalize STT run
    end

    alt Any batch succeeded
        rect rgba(187, 222, 251, 0.5)
        Cron->>Celery: Dispatch low-priority metric task (run_id, context)
        Celery->>DB: Fetch STTResult records (status=SUCCESS, run_id)
        DB-->>Celery: STTResult records
        Celery->>DB: Batch-fetch STTSample & Language records
        DB-->>Celery: Samples & Languages
        Celery->>Metrics: calculate_stt_metrics(hyp, ref, lang) per result
        Metrics-->>Celery: Per-sample scores
        Celery->>DB: Bulk-update STTResult with scores
        Celery->>Metrics: compute_run_aggregate_scores(result_scores)
        Metrics-->>Celery: Run-level aggregates
        Celery->>DB: Update STTRun with aggregate scores
        DB-->>Celery: Updated run
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Evaluation: STT #571 — Extends poll_stt_run flow; relevant because this change dispatches a low-priority metric computation task from the same polling path.

Suggested reviewers

  • kartpop
  • Prajna1999

Poem

🐰 I hopped through runs at break of dawn,

Counting whispers, syllables drawn,
WER and CER, tiny and bright,
Batch by batch I set them right,
Celery hums — metrics sing tonight.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'STT Evaluation: Automated metric' is incomplete and vague, lacking specificity about what automated metric functionality was added or modified. Revise the title to be more specific, such as 'Add automated metric computation for STT evaluations' or 'Implement async metric calculation for STT results' to clearly convey the main change.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch enhancement/stt-automated-metric

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AkhileshNegi AkhileshNegi added the enhancement New feature or request label Mar 5, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 97.84173% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/crud/stt_evaluations/cron.py 14.28% 6 Missing ⚠️
backend/app/services/stt_evaluations/metric_job.py 95.31% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@AkhileshNegi AkhileshNegi linked an issue Mar 5, 2026 that may be closed by this pull request
@AkhileshNegi AkhileshNegi marked this pull request as ready for review March 6, 2026 05:31
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/app/services/stt_evaluations/metrics.py (1)

14-14: Import Callable from collections.abc for Python 3.9+ compatibility.

The static analysis correctly flags this. Since the project uses Python 3.12+, importing from collections.abc is the recommended approach.

♻️ Proposed fix
-from typing import Any, Callable
+from collections.abc import Callable
+from typing import Any
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metrics.py` at line 14, The import of
Callable in metrics.py should come from collections.abc for Python 3.9+
compatibility: update the imports so Callable is imported from collections.abc
(keep Any from typing if still used) and remove Callable from typing; ensure any
references to Callable in functions or type hints (e.g., in this module's
metrics-related functions/classes) remain unchanged.
backend/app/services/stt_evaluations/metric_job.py (2)

128-140: Consider wrapping updates in a single transaction.

Currently, line 131 commits the bulk result updates, then line 136-140 updates the run separately. If the run update fails, individual result scores will be committed but run-level aggregates won't be stored. This inconsistency may be acceptable, but wrapping both in a single commit would ensure atomicity.

💡 Optional: Move commit after aggregate update
             # Bulk update all result scores
             if score_updates:
                 session.bulk_update_mappings(STTResult, score_updates)
-                session.commit()

             # Compute and store run-level aggregate scores
             if all_scores:
                 aggregate = compute_run_aggregate_scores(all_scores)
                 update_stt_run(
                     session=session,
                     run_id=run_id,
                     score=aggregate,
                 )
+
+            # Commit all updates together
+            if score_updates or all_scores:
+                session.commit()

Note: This would require modifying update_stt_run to not commit internally, which may have broader implications.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metric_job.py` around lines 128 - 140,
The bulk result updates (session.bulk_update_mappings on STTResult) are
committed separately from the run-level update (update_stt_run), risking partial
commits; wrap both the bulk updates and the compute_run_aggregate_scores ->
update_stt_run call in a single transaction so they commit atomically (move
session.commit to after update_stt_run), and adjust update_stt_run to accept/use
the provided session without committing internally so the outer transaction
controls the commit for run_id updates.

22-30: Consider documenting the unused parameters.

The static analysis flags job_id, task_instance, organization_id, and kwargs as unused. However, these are required by the Celery job execution framework (as shown in _execute_job_internal which passes them). Consider adding a brief comment or using _ prefix convention for the truly unused ones while keeping task_instance available for potential progress reporting.

💡 Optional: Add underscore prefix or comment
 def execute_metric_computation(
     project_id: int,
-    job_id: str,
+    job_id: str,  # noqa: ARG001 - Required by Celery job framework
     task_id: str,
-    task_instance: Any,
-    organization_id: int,
+    task_instance: Any,  # noqa: ARG001 - Available for progress updates
+    organization_id: int,  # noqa: ARG001 - Required by Celery job framework
     run_id: int,
-    **kwargs: Any,
+    **kwargs: Any,  # noqa: ARG001 - Reserved for future extensibility
 ) -> dict[str, Any]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metric_job.py` around lines 22 - 30, The
parameters job_id, organization_id, and kwargs are flagged as unused but are
required by the Celery executor; rename them with a leading underscore (e.g.,
_job_id, _organization_id, **_kwargs) in the execute_metric_computation
signature to indicate they are intentionally unused, keep task_instance named
(and add a brief comment above the function like "# task_instance kept for
possible progress reporting; other params required by Celery") so
reviewers/static analysis know why they exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/services/stt_evaluations/metric_job.py`:
- Around line 128-140: The bulk result updates (session.bulk_update_mappings on
STTResult) are committed separately from the run-level update (update_stt_run),
risking partial commits; wrap both the bulk updates and the
compute_run_aggregate_scores -> update_stt_run call in a single transaction so
they commit atomically (move session.commit to after update_stt_run), and adjust
update_stt_run to accept/use the provided session without committing internally
so the outer transaction controls the commit for run_id updates.
- Around line 22-30: The parameters job_id, organization_id, and kwargs are
flagged as unused but are required by the Celery executor; rename them with a
leading underscore (e.g., _job_id, _organization_id, **_kwargs) in the
execute_metric_computation signature to indicate they are intentionally unused,
keep task_instance named (and add a brief comment above the function like "#
task_instance kept for possible progress reporting; other params required by
Celery") so reviewers/static analysis know why they exist.

In `@backend/app/services/stt_evaluations/metrics.py`:
- Line 14: The import of Callable in metrics.py should come from collections.abc
for Python 3.9+ compatibility: update the imports so Callable is imported from
collections.abc (keep Any from typing if still used) and remove Callable from
typing; ensure any references to Callable in functions or type hints (e.g., in
this module's metrics-related functions/classes) remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 79c6af55-2862-419a-b7ba-fafbfa6f26d6

📥 Commits

Reviewing files that changed from the base of the PR and between ab46e3d and 84121d6.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • backend/app/crud/stt_evaluations/cron.py
  • backend/app/services/stt_evaluations/metric_job.py
  • backend/app/services/stt_evaluations/metrics.py
  • backend/pyproject.toml

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/app/services/stt_evaluations/metric_job.py (1)

129-141: Non-atomic commits between result scores and run aggregate.

The result scores are committed at line 132, then the run aggregate is updated separately at lines 137-141. If the aggregate update fails after scores are committed, the run will lack aggregate scores while individual results have their scores set.

This is likely acceptable for a background job (re-running would be idempotent), but consider moving the commit after the aggregate update for atomicity:

♻️ Optional refactor for atomicity
             # Bulk update all result scores
             if score_updates:
                 session.execute(update(STTResult), score_updates)
-                session.commit()

             # Compute and store run-level aggregate scores
             if all_scores:
                 aggregate = compute_run_aggregate_scores(all_scores)
                 update_stt_run(
                     session=session,
                     run_id=run_id,
                     score=aggregate,
                 )
+
+            # Commit all changes atomically
+            if score_updates or all_scores:
+                session.commit()

Note: This would require modifying update_stt_run to not commit internally, which may have wider implications.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metric_job.py` around lines 129 - 141,
The commit of individual STTResult scores is done before computing/updating the
run aggregate, risking partial state if update_stt_run fails; instead, apply the
bulk update (session.execute(update(STTResult), score_updates)) but defer
session.commit() until after compute_run_aggregate_scores(all_scores) and
calling update_stt_run(...). Ensure update_stt_run accepts the existing session
and does not commit internally (or add a new parameter to disable its internal
commit) so the single session.commit() after both updates makes the operation
atomic.
backend/app/services/stt_evaluations/metrics.py (1)

14-14: Import Callable from collections.abc for Python 3.9+ compatibility.

Per static analysis and Python best practices, Callable should be imported from collections.abc rather than typing.

♻️ Suggested fix
-from typing import Any, Callable
+from collections.abc import Callable
+from typing import Any
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metrics.py` at line 14, Replace the
current import of Callable from typing with the collections.abc variant to
ensure Python 3.9+ compatibility: change the import statement that currently
reads "from typing import Any, Callable" to import Any from typing (or typing as
needed) and import Callable from collections.abc; update the import line(s) in
metrics.py where "Callable" is referenced so existing type hints (e.g., any
functions or variables using Callable) continue to resolve correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/services/stt_evaluations/metric_job.py`:
- Around line 129-141: The commit of individual STTResult scores is done before
computing/updating the run aggregate, risking partial state if update_stt_run
fails; instead, apply the bulk update (session.execute(update(STTResult),
score_updates)) but defer session.commit() until after
compute_run_aggregate_scores(all_scores) and calling update_stt_run(...). Ensure
update_stt_run accepts the existing session and does not commit internally (or
add a new parameter to disable its internal commit) so the single
session.commit() after both updates makes the operation atomic.

In `@backend/app/services/stt_evaluations/metrics.py`:
- Line 14: Replace the current import of Callable from typing with the
collections.abc variant to ensure Python 3.9+ compatibility: change the import
statement that currently reads "from typing import Any, Callable" to import Any
from typing (or typing as needed) and import Callable from collections.abc;
update the import line(s) in metrics.py where "Callable" is referenced so
existing type hints (e.g., any functions or variables using Callable) continue
to resolve correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 87aa10fd-e77b-42c3-8b95-60b76ec30adf

📥 Commits

Reviewing files that changed from the base of the PR and between 84121d6 and f18d1ce.

📒 Files selected for processing (2)
  • backend/app/services/stt_evaluations/metric_job.py
  • backend/app/services/stt_evaluations/metrics.py

@AkhileshNegi AkhileshNegi requested a review from vprashrex March 6, 2026 06:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/app/tests/services/stt_evaluations/test_metric_job.py (2)

48-55: Return fresh kwargs per test instead of sharing a module-level mock.

BASE_KWARGS embeds a single MagicMock() instance for task_instance. If execute_metric_computation() touches that object, state and call history can leak across tests. A small factory keeps each case isolated.

Proposed refactor
-BASE_KWARGS: dict[str, Any] = {
-    "project_id": 1,
-    "job_id": "10",
-    "task_id": "celery-task-123",
-    "task_instance": MagicMock(),
-    "organization_id": 1,
-    "run_id": 10,
-}
+def _make_base_kwargs() -> dict[str, Any]:
+    return {
+        "project_id": 1,
+        "job_id": "10",
+        "task_id": "celery-task-123",
+        "task_instance": MagicMock(),
+        "organization_id": 1,
+        "run_id": 10,
+    }

Then switch call sites to execute_metric_computation(**_make_base_kwargs()). As per coding guidelines, backend/app/tests/**/*.py: "Use factory pattern for test fixtures in backend/app/tests/".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/services/stt_evaluations/test_metric_job.py` around lines
48 - 55, BASE_KWARGS currently holds a shared MagicMock() under "task_instance"
which can leak state between tests; replace this pattern by creating a factory
function (e.g., _make_base_kwargs) that returns a fresh dict with a new
MagicMock() for "task_instance" each call, update all test call sites to use
execute_metric_computation(**_make_base_kwargs()) instead of BASE_KWARGS, and
remove or stop using the module-level BASE_KWARGS to ensure isolation in tests
that call execute_metric_computation.

65-67: Add type hints to the injected patch mocks.

These test methods annotate -> None but leave the injected mock parameters untyped. Annotating them as MagicMock keeps the file aligned with the repo's typing rule.

Example
     def test_no_results_returns_zero_counts(
-        self, mock_session_cls, mock_now, mock_calc, mock_update_run
+        self,
+        mock_session_cls: MagicMock,
+        mock_now: MagicMock,
+        mock_calc: MagicMock,
+        mock_update_run: MagicMock,
     ) -> None:

Apply the same pattern to the other patched test methods in this class. As per coding guidelines, "Always add type hints to all function parameters and return values in Python code" and "Use Python 3.11+ with type hints throughout the codebase".

Also applies to: 79-81, 120-122, 141-143, 163-165, 189-191, 223-225

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/services/stt_evaluations/test_metric_job.py` around lines
65 - 67, Update the test method signatures to add explicit type hints for the
injected patch mocks (use typing.MagicMock) so parameters like mock_session_cls,
mock_now, mock_calc, mock_update_run (and the other patched params in methods
test_no_results_returns_zero_counts, plus the groups starting at lines
referenced: the methods around lines 79-81, 120-122, 141-143, 163-165, 189-191,
223-225) are annotated as MagicMock; import MagicMock from typing if not present
and apply the same pattern to all patched test methods in this test class so
every injected mock parameter has a MagicMock type hint and return annotations
remain -> None.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/tests/services/stt_evaluations/test_metric_job.py`:
- Around line 48-55: BASE_KWARGS currently holds a shared MagicMock() under
"task_instance" which can leak state between tests; replace this pattern by
creating a factory function (e.g., _make_base_kwargs) that returns a fresh dict
with a new MagicMock() for "task_instance" each call, update all test call sites
to use execute_metric_computation(**_make_base_kwargs()) instead of BASE_KWARGS,
and remove or stop using the module-level BASE_KWARGS to ensure isolation in
tests that call execute_metric_computation.
- Around line 65-67: Update the test method signatures to add explicit type
hints for the injected patch mocks (use typing.MagicMock) so parameters like
mock_session_cls, mock_now, mock_calc, mock_update_run (and the other patched
params in methods test_no_results_returns_zero_counts, plus the groups starting
at lines referenced: the methods around lines 79-81, 120-122, 141-143, 163-165,
189-191, 223-225) are annotated as MagicMock; import MagicMock from typing if
not present and apply the same pattern to all patched test methods in this test
class so every injected mock parameter has a MagicMock type hint and return
annotations remain -> None.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b874bfe-9479-44f3-88ec-efe4c31beac0

📥 Commits

Reviewing files that changed from the base of the PR and between f18d1ce and 52e5444.

📒 Files selected for processing (2)
  • backend/app/tests/services/stt_evaluations/test_metric_job.py
  • backend/app/tests/services/stt_evaluations/test_metrics.py

@AkhileshNegi AkhileshNegi requested a review from Prajna1999 March 6, 2026 07:08
@AkhileshNegi AkhileshNegi merged commit 20bacbd into main Mar 6, 2026
3 checks passed
@AkhileshNegi AkhileshNegi deleted the enhancement/stt-automated-metric branch March 6, 2026 10:26
nishika26 pushed a commit that referenced this pull request Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation: Automated metrics TTS/STT evals

2 participants