Batch TTS and voice conversion evaluation in one command.
Give PAVE a folder of synthesized audio and a manifest, and it runs a standardized set of metrics:
| Metric | What it measures | Notes |
|---|---|---|
| UTMOS | Predicted MOS (naturalness) | WavLM-based neural predictor |
| Speaker Sim | Speaker identity preservation | WavLM cosine similarity |
| WER | Word error rate | Whisper transcription |
| Phoneme Acc | Phoneme-level accuracy | 1 - PER |
| PESQ | Signal quality (perceptual) | Requires clean reference |
| STOI | Intelligibility | Requires clean reference |
I built this because I kept copy-pasting metric code between projects. It's not trying to be the definitive evaluation suite — just something fast and usable for iterating on TTS experiments.
pip install -e ".[signal,en]" # with PESQ/STOI and English phonemizer
# or
pip install -e "." # minimalPrepare a manifest (or let PAVE auto-build one from a directory):
{"id": "utt0001", "text": "The quick brown fox.", "synth": "outputs/utt0001.wav", "ref": "refs/utt0001.wav"}
{"id": "utt0002", "text": "Hello world.", "synth": "outputs/utt0002.wav", "ref": "refs/utt0002.wav"}Run from CLI:
pave run --manifest eval.jsonl --metrics utmos wer speaker_sim --device cudaOr from Python:
from pave import PAVEEvaluator
from pave.evaluator import EvalConfig
evaluator = PAVEEvaluator(EvalConfig(
manifest="eval.jsonl",
metrics={"utmos", "wer", "speaker_sim"},
device="cuda",
asr_model="small",
))
results = evaluator.run()
evaluator.print_summary(results)Output:
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━┓
┃ Metric ┃ Mean ┃ Std ┃ N ┃ Failed ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━┩
│ utmos │ 3.7100 │ 0.2300 │ 100 │ 0 │
│ wer │ 0.0820 │ 0.0410 │ 100 │ 0 │
│ speaker_sim │ 0.8300 │ 0.0610 │ 100 │ 0 │
└─────────────┴────────────┴────────────┴─────┴────────┘
from pave.utils import build_manifest
build_manifest(
synth_dir="outputs/",
ref_dir="refs/",
text_file="transcripts.txt",
output_path="eval.jsonl",
)pave score-file outputs/test.wav
# UTMOS: 3.8421- PESQ and STOI require a clean reference signal — not suitable for zero-shot TTS evaluation without ground-truth re-synthesis
- UTMOS scores are calibrated approximations; absolute values depend on the WavLM checkpoint used
- For Mandarin/Cantonese evaluation, set
--language zhor--language yue(Cantonese support is experimental) - The
phoneme_accmetric uses Whisper → text → phonemize pipeline, so it's measuring articulation indirectly. It's not a replacement for forced alignment.
MIT