STT Evaluation: Automated metric by AkhileshNegi · Pull Request #639 · ProjectTech4DevAI/kaapi-backend

AkhileshNegi · 2026-03-05T03:06:14Z

Summary

Target issue is #555
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Automatic background metric computation triggered after STT evaluations when any batch succeeds.
- Per-sample metrics: WER, CER, WIP, and lenient WER.
- Language-aware text normalization (Indic languages, Assamese, English) to improve metric accuracy.
- Run-level aggregate scores (mean, standard deviation, counts).
Tests
- Comprehensive unit tests for metric computation, normalization, edge cases, and aggregate scoring.
Chores
- Added runtime dependencies for metric calculation/normalization; minor docstring and import cleanups.

coderabbitai · 2026-03-05T03:06:21Z

📝 Walkthrough

Walkthrough

Adds language-aware STT metric computation modules and triggers a low-priority Celery job from the STT run poller to compute per-sample scores and run-level aggregates when any batch succeeded.

Changes

Cohort / File(s)	Summary
Metric computation modules `backend/app/services/stt_evaluations/metrics.py`, `backend/app/services/stt_evaluations/metric_job.py`	New modules: `metrics.py` implements normalization and per-sample metrics (WER, CER, WIP, lenient WER) and run aggregation; `metric_job.py` implements `execute_metric_computation` to fetch successful STTResult rows, compute scores, bulk-update results, and update run aggregates.
Task dispatch `backend/app/crud/stt_evaluations/cron.py`	Poller now conditionally dispatches a low-priority Celery metric computation task (with run_id and context) when any batch succeeded; logs dispatch and handles dispatch-time exceptions.
Dependencies `backend/pyproject.toml`	Adds runtime dependencies: `jiwer>=3.1.0`, `indic-nlp-library>=0.92`, and `whisper-normalizer>=0.1.12`.
Tests `backend/app/tests/services/stt_evaluations/test_metric_job.py`, `backend/app/tests/services/stt_evaluations/test_metrics.py`	New unit tests covering normalization, per-sample metric calculations, aggregation, job execution paths (scored/skipped/failed), and DB interaction mocks.
Minor cleanup `backend/app/models/llm/response.py`, `backend/app/services/llm/providers/registry.py`	Small docstring typo fix and removal of unused imports; no functional changes.

Sequence Diagram

sequenceDiagram
    participant Cron as STT Cron
    participant Celery as Celery Task
    participant DB as Database
    participant Metrics as Metrics Service

    rect rgba(200, 230, 201, 0.5)
    Cron->>Cron: Poll and finalize STT run
    end

    alt Any batch succeeded
        rect rgba(187, 222, 251, 0.5)
        Cron->>Celery: Dispatch low-priority metric task (run_id, context)
        Celery->>DB: Fetch STTResult records (status=SUCCESS, run_id)
        DB-->>Celery: STTResult records
        Celery->>DB: Batch-fetch STTSample & Language records
        DB-->>Celery: Samples & Languages
        Celery->>Metrics: calculate_stt_metrics(hyp, ref, lang) per result
        Metrics-->>Celery: Per-sample scores
        Celery->>DB: Bulk-update STTResult with scores
        Celery->>Metrics: compute_run_aggregate_scores(result_scores)
        Metrics-->>Celery: Run-level aggregates
        Celery->>DB: Update STTRun with aggregate scores
        DB-->>Celery: Updated run
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Evaluation: STT #571 — Extends poll_stt_run flow; relevant because this change dispatches a low-priority metric computation task from the same polling path.

Suggested reviewers

kartpop
Prajna1999

Poem

🐰 I hopped through runs at break of dawn,

Counting whispers, syllables drawn,
WER and CER, tiny and bright,
Batch by batch I set them right,
Celery hums — metrics sing tonight.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'STT Evaluation: Automated metric' is incomplete and vague, lacking specificity about what automated metric functionality was added or modified.	Revise the title to be more specific, such as 'Add automated metric computation for STT evaluations' or 'Implement async metric calculation for STT results' to clearly convey the main change.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch enhancement/stt-automated-metric

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-05T03:12:09Z

Codecov Report

❌ Patch coverage is 97.84173% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/stt_evaluations/cron.py	14.28%	6 Missing ⚠️
backend/app/services/stt_evaluations/metric_job.py	95.31%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

🧹 Nitpick comments (3)

backend/app/services/stt_evaluations/metrics.py (1)

14-14: Import Callable from collections.abc for Python 3.9+ compatibility.

The static analysis correctly flags this. Since the project uses Python 3.12+, importing from collections.abc is the recommended approach.

♻️ Proposed fix

-from typing import Any, Callable
+from collections.abc import Callable
+from typing import Any

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metrics.py` at line 14, The import of
Callable in metrics.py should come from collections.abc for Python 3.9+
compatibility: update the imports so Callable is imported from collections.abc
(keep Any from typing if still used) and remove Callable from typing; ensure any
references to Callable in functions or type hints (e.g., in this module's
metrics-related functions/classes) remain unchanged.

backend/app/services/stt_evaluations/metric_job.py (2)

128-140: Consider wrapping updates in a single transaction.

Currently, line 131 commits the bulk result updates, then line 136-140 updates the run separately. If the run update fails, individual result scores will be committed but run-level aggregates won't be stored. This inconsistency may be acceptable, but wrapping both in a single commit would ensure atomicity.

💡 Optional: Move commit after aggregate update

             # Bulk update all result scores
             if score_updates:
                 session.bulk_update_mappings(STTResult, score_updates)
-                session.commit()

             # Compute and store run-level aggregate scores
             if all_scores:
                 aggregate = compute_run_aggregate_scores(all_scores)
                 update_stt_run(
                     session=session,
                     run_id=run_id,
                     score=aggregate,
                 )
+
+            # Commit all updates together
+            if score_updates or all_scores:
+                session.commit()

Note: This would require modifying update_stt_run to not commit internally, which may have broader implications.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metric_job.py` around lines 128 - 140,
The bulk result updates (session.bulk_update_mappings on STTResult) are
committed separately from the run-level update (update_stt_run), risking partial
commits; wrap both the bulk updates and the compute_run_aggregate_scores ->
update_stt_run call in a single transaction so they commit atomically (move
session.commit to after update_stt_run), and adjust update_stt_run to accept/use
the provided session without committing internally so the outer transaction
controls the commit for run_id updates.

22-30: Consider documenting the unused parameters.

The static analysis flags job_id, task_instance, organization_id, and kwargs as unused. However, these are required by the Celery job execution framework (as shown in _execute_job_internal which passes them). Consider adding a brief comment or using _ prefix convention for the truly unused ones while keeping task_instance available for potential progress reporting.

💡 Optional: Add underscore prefix or comment

 def execute_metric_computation(
     project_id: int,
-    job_id: str,
+    job_id: str,  # noqa: ARG001 - Required by Celery job framework
     task_id: str,
-    task_instance: Any,
-    organization_id: int,
+    task_instance: Any,  # noqa: ARG001 - Available for progress updates
+    organization_id: int,  # noqa: ARG001 - Required by Celery job framework
     run_id: int,
-    **kwargs: Any,
+    **kwargs: Any,  # noqa: ARG001 - Reserved for future extensibility
 ) -> dict[str, Any]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metric_job.py` around lines 22 - 30, The
parameters job_id, organization_id, and kwargs are flagged as unused but are
required by the Celery executor; rename them with a leading underscore (e.g.,
_job_id, _organization_id, **_kwargs) in the execute_metric_computation
signature to indicate they are intentionally unused, keep task_instance named
(and add a brief comment above the function like "# task_instance kept for
possible progress reporting; other params required by Celery") so
reviewers/static analysis know why they exist.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/services/stt_evaluations/metric_job.py`:
- Around line 128-140: The bulk result updates (session.bulk_update_mappings on
STTResult) are committed separately from the run-level update (update_stt_run),
risking partial commits; wrap both the bulk updates and the
compute_run_aggregate_scores -> update_stt_run call in a single transaction so
they commit atomically (move session.commit to after update_stt_run), and adjust
update_stt_run to accept/use the provided session without committing internally
so the outer transaction controls the commit for run_id updates.
- Around line 22-30: The parameters job_id, organization_id, and kwargs are
flagged as unused but are required by the Celery executor; rename them with a
leading underscore (e.g., _job_id, _organization_id, **_kwargs) in the
execute_metric_computation signature to indicate they are intentionally unused,
keep task_instance named (and add a brief comment above the function like "#
task_instance kept for possible progress reporting; other params required by
Celery") so reviewers/static analysis know why they exist.

In `@backend/app/services/stt_evaluations/metrics.py`:
- Line 14: The import of Callable in metrics.py should come from collections.abc
for Python 3.9+ compatibility: update the imports so Callable is imported from
collections.abc (keep Any from typing if still used) and remove Callable from
typing; ensure any references to Callable in functions or type hints (e.g., in
this module's metrics-related functions/classes) remain unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 79c6af55-2862-419a-b7ba-fafbfa6f26d6

📥 Commits

Reviewing files that changed from the base of the PR and between ab46e3d and 84121d6.

⛔ Files ignored due to path filters (1)

backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

backend/app/crud/stt_evaluations/cron.py
backend/app/services/stt_evaluations/metric_job.py
backend/app/services/stt_evaluations/metrics.py
backend/pyproject.toml

coderabbitai

🧹 Nitpick comments (2)

backend/app/services/stt_evaluations/metric_job.py (1)
129-141: Non-atomic commits between result scores and run aggregate.

The result scores are committed at line 132, then the run aggregate is updated separately at lines 137-141. If the aggregate update fails after scores are committed, the run will lack aggregate scores while individual results have their scores set.

This is likely acceptable for a background job (re-running would be idempotent), but consider moving the commit after the aggregate update for atomicity:
♻️ Optional refactor for atomicity
             # Bulk update all result scores
             if score_updates:
                 session.execute(update(STTResult), score_updates)
-                session.commit()

             # Compute and store run-level aggregate scores
             if all_scores:
                 aggregate = compute_run_aggregate_scores(all_scores)
                 update_stt_run(
                     session=session,
                     run_id=run_id,
                     score=aggregate,
                 )
+
+            # Commit all changes atomically
+            if score_updates or all_scores:
+                session.commit()
Note: This would require modifying update_stt_run to not commit internally, which may have wider implications.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metric_job.py` around lines 129 - 141,
The commit of individual STTResult scores is done before computing/updating the
run aggregate, risking partial state if update_stt_run fails; instead, apply the
bulk update (session.execute(update(STTResult), score_updates)) but defer
session.commit() until after compute_run_aggregate_scores(all_scores) and
calling update_stt_run(...). Ensure update_stt_run accepts the existing session
and does not commit internally (or add a new parameter to disable its internal
commit) so the single session.commit() after both updates makes the operation
atomic.
backend/app/services/stt_evaluations/metrics.py (1)
14-14: Import Callable from collections.abc for Python 3.9+ compatibility.

Per static analysis and Python best practices, Callable should be imported from collections.abc rather than typing.
♻️ Suggested fix
-from typing import Any, Callable
+from collections.abc import Callable
+from typing import Any
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/metrics.py` at line 14, Replace the
current import of Callable from typing with the collections.abc variant to
ensure Python 3.9+ compatibility: change the import statement that currently
reads "from typing import Any, Callable" to import Any from typing (or typing as
needed) and import Callable from collections.abc; update the import line(s) in
metrics.py where "Callable" is referenced so existing type hints (e.g., any
functions or variables using Callable) continue to resolve correctly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/services/stt_evaluations/metric_job.py`:
- Around line 129-141: The commit of individual STTResult scores is done before
computing/updating the run aggregate, risking partial state if update_stt_run
fails; instead, apply the bulk update (session.execute(update(STTResult),
score_updates)) but defer session.commit() until after
compute_run_aggregate_scores(all_scores) and calling update_stt_run(...). Ensure
update_stt_run accepts the existing session and does not commit internally (or
add a new parameter to disable its internal commit) so the single
session.commit() after both updates makes the operation atomic.

In `@backend/app/services/stt_evaluations/metrics.py`:
- Line 14: Replace the current import of Callable from typing with the
collections.abc variant to ensure Python 3.9+ compatibility: change the import
statement that currently reads "from typing import Any, Callable" to import Any
from typing (or typing as needed) and import Callable from collections.abc;
update the import line(s) in metrics.py where "Callable" is referenced so
existing type hints (e.g., any functions or variables using Callable) continue
to resolve correctly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 87aa10fd-e77b-42c3-8b95-60b76ec30adf

📥 Commits

Reviewing files that changed from the base of the PR and between 84121d6 and f18d1ce.

📒 Files selected for processing (2)

backend/app/services/stt_evaluations/metric_job.py
backend/app/services/stt_evaluations/metrics.py

coderabbitai

🧹 Nitpick comments (2)

backend/app/tests/services/stt_evaluations/test_metric_job.py (2)
48-55: Return fresh kwargs per test instead of sharing a module-level mock.

BASE_KWARGS embeds a single MagicMock() instance for task_instance. If execute_metric_computation() touches that object, state and call history can leak across tests. A small factory keeps each case isolated.
Proposed refactor
-BASE_KWARGS: dict[str, Any] = {
-    "project_id": 1,
-    "job_id": "10",
-    "task_id": "celery-task-123",
-    "task_instance": MagicMock(),
-    "organization_id": 1,
-    "run_id": 10,
-}
+def _make_base_kwargs() -> dict[str, Any]:
+    return {
+        "project_id": 1,
+        "job_id": "10",
+        "task_id": "celery-task-123",
+        "task_instance": MagicMock(),
+        "organization_id": 1,
+        "run_id": 10,
+    }
Then switch call sites to execute_metric_computation(**_make_base_kwargs()). As per coding guidelines, backend/app/tests/**/*.py: "Use factory pattern for test fixtures in backend/app/tests/".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/services/stt_evaluations/test_metric_job.py` around lines
48 - 55, BASE_KWARGS currently holds a shared MagicMock() under "task_instance"
which can leak state between tests; replace this pattern by creating a factory
function (e.g., _make_base_kwargs) that returns a fresh dict with a new
MagicMock() for "task_instance" each call, update all test call sites to use
execute_metric_computation(**_make_base_kwargs()) instead of BASE_KWARGS, and
remove or stop using the module-level BASE_KWARGS to ensure isolation in tests
that call execute_metric_computation.
65-67: Add type hints to the injected patch mocks.

These test methods annotate -> None but leave the injected mock parameters untyped. Annotating them as MagicMock keeps the file aligned with the repo's typing rule.
Example
     def test_no_results_returns_zero_counts(
-        self, mock_session_cls, mock_now, mock_calc, mock_update_run
+        self,
+        mock_session_cls: MagicMock,
+        mock_now: MagicMock,
+        mock_calc: MagicMock,
+        mock_update_run: MagicMock,
     ) -> None:
Apply the same pattern to the other patched test methods in this class. As per coding guidelines, "Always add type hints to all function parameters and return values in Python code" and "Use Python 3.11+ with type hints throughout the codebase".

Also applies to: 79-81, 120-122, 141-143, 163-165, 189-191, 223-225
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/services/stt_evaluations/test_metric_job.py` around lines
65 - 67, Update the test method signatures to add explicit type hints for the
injected patch mocks (use typing.MagicMock) so parameters like mock_session_cls,
mock_now, mock_calc, mock_update_run (and the other patched params in methods
test_no_results_returns_zero_counts, plus the groups starting at lines
referenced: the methods around lines 79-81, 120-122, 141-143, 163-165, 189-191,
223-225) are annotated as MagicMock; import MagicMock from typing if not present
and apply the same pattern to all patched test methods in this test class so
every injected mock parameter has a MagicMock type hint and return annotations
remain -> None.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/tests/services/stt_evaluations/test_metric_job.py`:
- Around line 48-55: BASE_KWARGS currently holds a shared MagicMock() under
"task_instance" which can leak state between tests; replace this pattern by
creating a factory function (e.g., _make_base_kwargs) that returns a fresh dict
with a new MagicMock() for "task_instance" each call, update all test call sites
to use execute_metric_computation(**_make_base_kwargs()) instead of BASE_KWARGS,
and remove or stop using the module-level BASE_KWARGS to ensure isolation in
tests that call execute_metric_computation.
- Around line 65-67: Update the test method signatures to add explicit type
hints for the injected patch mocks (use typing.MagicMock) so parameters like
mock_session_cls, mock_now, mock_calc, mock_update_run (and the other patched
params in methods test_no_results_returns_zero_counts, plus the groups starting
at lines referenced: the methods around lines 79-81, 120-122, 141-143, 163-165,
189-191, 223-225) are annotated as MagicMock; import MagicMock from typing if
not present and apply the same pattern to all patched test methods in this test
class so every injected mock parameter has a MagicMock type hint and return
annotations remain -> None.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b874bfe-9479-44f3-88ec-efe4c31beac0

📥 Commits

Reviewing files that changed from the base of the PR and between f18d1ce and 52e5444.

📒 Files selected for processing (2)

backend/app/tests/services/stt_evaluations/test_metric_job.py
backend/app/tests/services/stt_evaluations/test_metrics.py

AkhileshNegi added 3 commits March 4, 2026 19:47

first stab at automated metric in STT

80b5f9a

added openai normalizer

462263e

code cleanups

232bf1d

AkhileshNegi self-assigned this Mar 5, 2026

AkhileshNegi added the enhancement New feature or request label Mar 5, 2026

Merge branch 'main' into enhancement/stt-automated-metric

5afaf6b

AkhileshNegi linked an issue Mar 5, 2026 that may be closed by this pull request

Evaluation: Automated metrics TTS/STT evals #555

Closed

AkhileshNegi marked this pull request as ready for review March 6, 2026 05:31

Merge branch 'main' into enhancement/stt-automated-metric

84121d6

coderabbitai Bot reviewed Mar 6, 2026

View reviewed changes

code cleanups

f18d1ce

coderabbitai Bot reviewed Mar 6, 2026

View reviewed changes

added testcases

52e5444

AkhileshNegi requested a review from vprashrex March 6, 2026 06:29

coderabbitai Bot reviewed Mar 6, 2026

View reviewed changes

AkhileshNegi added 2 commits March 6, 2026 12:31

Merge branch 'main' into enhancement/stt-automated-metric

dc1d993

fixing typos and cleanup

6bbb750

AkhileshNegi requested a review from Prajna1999 March 6, 2026 07:08

vprashrex approved these changes Mar 6, 2026

View reviewed changes

AkhileshNegi merged commit 20bacbd into main Mar 6, 2026
3 checks passed

AkhileshNegi deleted the enhancement/stt-automated-metric branch March 6, 2026 10:26

nishika26 pushed a commit that referenced this pull request Mar 10, 2026

STT Evaluation: Automated metric (#639)

bc5d32c

coderabbitai Bot mentioned this pull request Mar 10, 2026

STT Evaluation: Per sample language ID #657

Merged

2 tasks

This was referenced Mar 23, 2026

Feature/celery new execute job #714

Merged

Support new Open AI model configuration #717

Merged

coderabbitai Bot mentioned this pull request Apr 30, 2026

Gevent: Implement soft time limit #800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STT Evaluation: Automated metric#639

STT Evaluation: Automated metric#639
AkhileshNegi merged 9 commits into
mainfrom
enhancement/stt-automated-metric

AkhileshNegi commented Mar 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AkhileshNegi commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AkhileshNegi commented Mar 5, 2026 •

edited

Loading

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

codecov Bot commented Mar 5, 2026 •

edited

Loading