A lightweight codec language model for low-resource and on-device text-to-speech.
LiteCodecLM adapts the VALL-E codec LM framework into a smaller, fine-tuning-friendly architecture. It uses EnCodec audio tokenization and a two-stage AR+NAR transformer to generate speech from phoneme sequences. LoRA support means you can adapt a base model to a new speaker or language with minimal compute.
This is a research project — I'm using it to experiment with codec LM training dynamics, especially under low-resource conditions (think: <50h of target speaker data, or low-resource languages like Cantonese).
Text ──► Phonemizer ──► Text Tokens ──┐
▼
┌─────────────┐ Level-0 tokens
│ AR Decoder │ ──────────────────►─┐
└─────────────┘ │
▼
┌──────────────┐
│ NAR Decoder │ ──► All codes (n_q, T)
└──────────────┘ │
▼
EnCodec Decode
│
▼
Waveform 🔉
- AR Decoder: Causal transformer, autoregressively generates level-0 codec tokens from text + speaker prompt
- NAR Decoder: Bidirectional transformer, predicts levels 1–7 in parallel given level 0
- LoRA: Optional low-rank adaptation for parameter-efficient fine-tuning
- Tokenizer: Supports English (espeak), Mandarin (pypinyin), and experimental Cantonese (jyutping)
Two model sizes:
| Config | AR params | NAR params | Total | VRAM (bs=8) |
|---|---|---|---|---|
| base | ~370M | ~370M | ~740M | ~40GB |
| lite | ~95M | ~95M | ~190M | ~12GB |
git clone https://github.com/gettempdir/LiteCodecLM
cd LiteCodecLM
pip install -e ".[en,zh,dev]"
# For Cantonese support:
pip install pycantonese# Tokenize a dataset (e.g. LJSpeech)
python scripts/prepare_data.py \
--wav_dir data/LJSpeech-1.1/wavs \
--transcript data/LJSpeech-1.1/metadata.csv \
--out_dir data/ljspeech_codes \
--manifest_out data/train.jsonl# Train lite model from scratch
python scripts/train.py --config configs/lite.yaml
# Fine-tune with LoRA from a pretrained checkpoint
python scripts/train.py \
--config configs/lite.yaml \
--checkpoint checkpoints/lite/best.pt \
--lorapython scripts/synthesize.py \
--config configs/lite.yaml \
--checkpoint checkpoints/lite/best.pt \
--text "The quick brown fox jumps over the lazy dog." \
--prompt_wav data/speaker_prompt.wav \
--out output.wavPreliminary results on LJSpeech (single speaker, ~24h training data, lite config, 200 epochs):
| Metric | Value | Notes |
|---|---|---|
| UTMOS | 3.71 | vs. GT: 4.05 |
| WER (ASR) | 8.2% | Whisper-small evaluation |
| Spk. Sim. | 0.83 | Cosine similarity (WavLM-L) |
| RTF (CPU) | ~4.2× | AR is the bottleneck |
These numbers are from my early runs and will likely improve with longer training.
I am working on a Yue Chinese (Cantonese) version trained on HK government radio broadcasts. The jyutping tokenizer is functional but needs coverage improvements. Stay tuned.
- The Cantonese tokenizer is very experimental — I haven't trained a Cantonese model yet, just making sure the pipeline doesn't crash
- RTF on CPU is slow due to autoregressive decoding; speculative decoding is on my todo list
- LoRA fine-tuning with 30min of target speaker data gives recognizable but not great voice cloning
MIT