Skip to content

gettempdir/LiteCodecLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LiteCodecLM 🔊

A lightweight codec language model for low-resource and on-device text-to-speech.

LiteCodecLM adapts the VALL-E codec LM framework into a smaller, fine-tuning-friendly architecture. It uses EnCodec audio tokenization and a two-stage AR+NAR transformer to generate speech from phoneme sequences. LoRA support means you can adapt a base model to a new speaker or language with minimal compute.

This is a research project — I'm using it to experiment with codec LM training dynamics, especially under low-resource conditions (think: <50h of target speaker data, or low-resource languages like Cantonese).

Architecture

Text ──► Phonemizer ──► Text Tokens ──┐
                                      ▼
                               ┌─────────────┐     Level-0 tokens
                               │ AR Decoder  │ ──────────────────►─┐
                               └─────────────┘                     │
                                                                    ▼
                                                           ┌──────────────┐
                                                           │ NAR Decoder  │ ──► All codes (n_q, T)
                                                           └──────────────┘         │
                                                                                    ▼
                                                                              EnCodec Decode
                                                                                    │
                                                                                    ▼
                                                                               Waveform 🔉
  • AR Decoder: Causal transformer, autoregressively generates level-0 codec tokens from text + speaker prompt
  • NAR Decoder: Bidirectional transformer, predicts levels 1–7 in parallel given level 0
  • LoRA: Optional low-rank adaptation for parameter-efficient fine-tuning
  • Tokenizer: Supports English (espeak), Mandarin (pypinyin), and experimental Cantonese (jyutping)

Two model sizes:

Config AR params NAR params Total VRAM (bs=8)
base ~370M ~370M ~740M ~40GB
lite ~95M ~95M ~190M ~12GB

Installation

git clone https://github.com/gettempdir/LiteCodecLM
cd LiteCodecLM
pip install -e ".[en,zh,dev]"

# For Cantonese support:
pip install pycantonese

Quick Start

Prepare data

# Tokenize a dataset (e.g. LJSpeech)
python scripts/prepare_data.py \
    --wav_dir data/LJSpeech-1.1/wavs \
    --transcript data/LJSpeech-1.1/metadata.csv \
    --out_dir data/ljspeech_codes \
    --manifest_out data/train.jsonl

Train

# Train lite model from scratch
python scripts/train.py --config configs/lite.yaml

# Fine-tune with LoRA from a pretrained checkpoint
python scripts/train.py \
    --config configs/lite.yaml \
    --checkpoint checkpoints/lite/best.pt \
    --lora

Synthesize

python scripts/synthesize.py \
    --config configs/lite.yaml \
    --checkpoint checkpoints/lite/best.pt \
    --text "The quick brown fox jumps over the lazy dog." \
    --prompt_wav data/speaker_prompt.wav \
    --out output.wav

Results

Preliminary results on LJSpeech (single speaker, ~24h training data, lite config, 200 epochs):

Metric Value Notes
UTMOS 3.71 vs. GT: 4.05
WER (ASR) 8.2% Whisper-small evaluation
Spk. Sim. 0.83 Cosine similarity (WavLM-L)
RTF (CPU) ~4.2× AR is the bottleneck

These numbers are from my early runs and will likely improve with longer training.

Cantonese Support (Experimental)

I am working on a Yue Chinese (Cantonese) version trained on HK government radio broadcasts. The jyutping tokenizer is functional but needs coverage improvements. Stay tuned.

Notes & Caveats

  • The Cantonese tokenizer is very experimental — I haven't trained a Cantonese model yet, just making sure the pipeline doesn't crash
  • RTF on CPU is slow due to autoregressive decoding; speculative decoding is on my todo list
  • LoRA fine-tuning with 30min of target speaker data gives recognizable but not great voice cloning

License

MIT

About

Lightweight codec language model for low-resource and on-device TTS, with LoRA fine-tuning support

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages