Skip to content

gettempdir/pave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PAVE — Phoneme-Aware Voice Evaluator

Python 3.9+ License: MIT

Batch TTS and voice conversion evaluation in one command.

Give PAVE a folder of synthesized audio and a manifest, and it runs a standardized set of metrics:

Metric What it measures Notes
UTMOS Predicted MOS (naturalness) WavLM-based neural predictor
Speaker Sim Speaker identity preservation WavLM cosine similarity
WER Word error rate Whisper transcription
Phoneme Acc Phoneme-level accuracy 1 - PER
PESQ Signal quality (perceptual) Requires clean reference
STOI Intelligibility Requires clean reference

I built this because I kept copy-pasting metric code between projects. It's not trying to be the definitive evaluation suite — just something fast and usable for iterating on TTS experiments.

Install

pip install -e ".[signal,en]"    # with PESQ/STOI and English phonemizer
# or
pip install -e "."               # minimal

Quick Start

Prepare a manifest (or let PAVE auto-build one from a directory):

{"id": "utt0001", "text": "The quick brown fox.", "synth": "outputs/utt0001.wav", "ref": "refs/utt0001.wav"}
{"id": "utt0002", "text": "Hello world.", "synth": "outputs/utt0002.wav", "ref": "refs/utt0002.wav"}

Run from CLI:

pave run --manifest eval.jsonl --metrics utmos wer speaker_sim --device cuda

Or from Python:

from pave import PAVEEvaluator
from pave.evaluator import EvalConfig

evaluator = PAVEEvaluator(EvalConfig(
    manifest="eval.jsonl",
    metrics={"utmos", "wer", "speaker_sim"},
    device="cuda",
    asr_model="small",
))
results = evaluator.run()
evaluator.print_summary(results)

Output:

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━┓
┃ Metric      ┃ Mean       ┃ Std        ┃ N   ┃ Failed ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━┩
│ utmos       │ 3.7100     │ 0.2300     │ 100 │ 0      │
│ wer         │ 0.0820     │ 0.0410     │ 100 │ 0      │
│ speaker_sim │ 0.8300     │ 0.0610     │ 100 │ 0      │
└─────────────┴────────────┴────────────┴─────┴────────┘

Building a manifest automatically

from pave.utils import build_manifest

build_manifest(
    synth_dir="outputs/",
    ref_dir="refs/",
    text_file="transcripts.txt",
    output_path="eval.jsonl",
)

Score a single file

pave score-file outputs/test.wav
# UTMOS: 3.8421

Notes

  • PESQ and STOI require a clean reference signal — not suitable for zero-shot TTS evaluation without ground-truth re-synthesis
  • UTMOS scores are calibrated approximations; absolute values depend on the WavLM checkpoint used
  • For Mandarin/Cantonese evaluation, set --language zh or --language yue (Cantonese support is experimental)
  • The phoneme_acc metric uses Whisper → text → phonemize pipeline, so it's measuring articulation indirectly. It's not a replacement for forced alignment.

License

MIT

About

Phoneme-Aware Voice Evaluator: batch evaluation toolkit for TTS and voice conversion systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages