PAVE — Phoneme-Aware Voice Evaluator

Batch TTS and voice conversion evaluation in one command.

Give PAVE a folder of synthesized audio and a manifest, and it runs a standardized set of metrics:

Metric	What it measures	Notes
UTMOS	Predicted MOS (naturalness)	WavLM-based neural predictor
Speaker Sim	Speaker identity preservation	WavLM cosine similarity
WER	Word error rate	Whisper transcription
Phoneme Acc	Phoneme-level accuracy	1 - PER
PESQ	Signal quality (perceptual)	Requires clean reference
STOI	Intelligibility	Requires clean reference

I built this because I kept copy-pasting metric code between projects. It's not trying to be the definitive evaluation suite — just something fast and usable for iterating on TTS experiments.

Install

pip install -e ".[signal,en]"    # with PESQ/STOI and English phonemizer
# or
pip install -e "."               # minimal

Quick Start

Prepare a manifest (or let PAVE auto-build one from a directory):

{"id": "utt0001", "text": "The quick brown fox.", "synth": "outputs/utt0001.wav", "ref": "refs/utt0001.wav"}
{"id": "utt0002", "text": "Hello world.", "synth": "outputs/utt0002.wav", "ref": "refs/utt0002.wav"}

Run from CLI:

pave run --manifest eval.jsonl --metrics utmos wer speaker_sim --device cuda

Or from Python:

from pave import PAVEEvaluator
from pave.evaluator import EvalConfig

evaluator = PAVEEvaluator(EvalConfig(
    manifest="eval.jsonl",
    metrics={"utmos", "wer", "speaker_sim"},
    device="cuda",
    asr_model="small",
))
results = evaluator.run()
evaluator.print_summary(results)

Output:

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━┓
┃ Metric      ┃ Mean       ┃ Std        ┃ N   ┃ Failed ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━┩
│ utmos       │ 3.7100     │ 0.2300     │ 100 │ 0      │
│ wer         │ 0.0820     │ 0.0410     │ 100 │ 0      │
│ speaker_sim │ 0.8300     │ 0.0610     │ 100 │ 0      │
└─────────────┴────────────┴────────────┴─────┴────────┘

Building a manifest automatically

from pave.utils import build_manifest

build_manifest(
    synth_dir="outputs/",
    ref_dir="refs/",
    text_file="transcripts.txt",
    output_path="eval.jsonl",
)

Score a single file

pave score-file outputs/test.wav
# UTMOS: 3.8421

Notes

PESQ and STOI require a clean reference signal — not suitable for zero-shot TTS evaluation without ground-truth re-synthesis
UTMOS scores are calibrated approximations; absolute values depend on the WavLM checkpoint used
For Mandarin/Cantonese evaluation, set --language zh or --language yue (Cantonese support is experimental)
The phoneme_acc metric uses Whisper → text → phonemize pipeline, so it's measuring articulation indirectly. It's not a replacement for forced alignment.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
pave		pave
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAVE — Phoneme-Aware Voice Evaluator

Install

Quick Start

Building a manifest automatically

Score a single file

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PAVE — Phoneme-Aware Voice Evaluator

Install

Quick Start

Building a manifest automatically

Score a single file

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages