Add compare command for transcript diff against gold-standard references by Jemoka · Pull Request #71 · TalkBank/batchalign2

Jemoka · 2026-02-27T17:57:13Z

Introduces a compare CLI command that takes main (.cha) and gold (.gold.cha)
transcripts, aligns them word-by-word using the existing Hirschberg DP algorithm
with conform/match_fn normalization, and annotates each utterance with %xsrep
(diff tokens with +/- markers) and %xsmor (POS tags) tiers. An analysis engine
computes WER metrics and writes them to a .compare.csv file.

Changes:

document.py: Add CompareToken model, Utterance.comparison field, Task.COMPARE
and Task.COMPARE_ANALYSIS enum entries
formats/chat/parser.py: Parse %xsrep/%xsmor lines into CompareToken list
formats/chat/generator.py: Emit %xsrep/%xsmor lines from comparison data
pipelines/analysis/compare.py: New CompareEngine (processing) and
CompareAnalysisEngine (analysis) with self-contained conform/match_fn
pipelines/dispatch.py: Register new engines in DEFAULT_PACKAGES
cli/dispatch.py: Add compare command handling with gold file pairing
cli/cli.py: Add compare Click command with --lang and --merge-abbrev
cli/bench.py: Add compare to benchmark command choices
tests: Add pytest.importorskip() for optional deps (torch, numpy,
num2words, filelock, praatio) so tests skip gracefully

https://claude.ai/code/session_01QGhmSCjjWdLn43o9PvB85b

Introduces a `compare` CLI command that takes main (.cha) and gold (.gold.cha) transcripts, aligns them word-by-word using the existing Hirschberg DP algorithm with conform/match_fn normalization, and annotates each utterance with %xsrep (diff tokens with +/- markers) and %xsmor (POS tags) tiers. An analysis engine computes WER metrics and writes them to a .compare.csv file. Changes: - document.py: Add CompareToken model, Utterance.comparison field, Task.COMPARE and Task.COMPARE_ANALYSIS enum entries - formats/chat/parser.py: Parse %xsrep/%xsmor lines into CompareToken list - formats/chat/generator.py: Emit %xsrep/%xsmor lines from comparison data - pipelines/analysis/compare.py: New CompareEngine (processing) and CompareAnalysisEngine (analysis) with self-contained conform/match_fn - pipelines/dispatch.py: Register new engines in DEFAULT_PACKAGES - cli/dispatch.py: Add compare command handling with gold file pairing - cli/cli.py: Add compare Click command with --lang and --merge-abbrev - cli/bench.py: Add compare to benchmark command choices - tests: Add pytest.importorskip() for optional deps (torch, numpy, num2words, filelock, praatio) so tests skip gracefully https://claude.ai/code/session_01QGhmSCjjWdLn43o9PvB85b

- Parse main doc without special_mor_ so existing %mor/%gra are read into Form objects normally (Stanza overwrites, generator emits once) - Gold still uses special_mor_=True for lenient parsing - Track form_idx for each comparison token and interleave punctuation at original positions instead of appending all punct at end https://claude.ai/code/session_01QGhmSCjjWdLn43o9PvB85b

claude added 2 commits February 27, 2026 17:36

Jemoka merged commit 93a2dd7 into master Feb 27, 2026
1 of 4 checks passed

Jemoka deleted the claude/add-compare-function-61RRK branch February 27, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compare command for transcript diff against gold-standard references#71

Add compare command for transcript diff against gold-standard references#71
Jemoka merged 2 commits intomasterfrom
claude/add-compare-function-61RRK

Jemoka commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jemoka commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants