Laban TTS

Deterministic, file-first audiobook production pipeline: normalize → cue → synthesize. Every stage is an explicit CLI command that reads and writes well-defined artifacts so engineers, directors, and narrators can audit, edit, and rerun any step without hidden state.

parts/ → normalize/ → cues/ → audio/ (→ finalize/)

Requirements

Python 3.11 (ships with the provided PyTorch CUDA image).
uv package manager (already available in the container).
OpenAI project/API key exported as OPENAI_API_KEY (used for normalization + cueing via gpt-5-mini-2025-08-07).
ffmpeg in the system path (needed by pydub to mux WAVs).
Optional GPU (8 GiB VRAM works). The toolchain auto-falls back to CPU if chatterbox-tts cannot load on CUDA.

Installation

# From the repository root
uv pip install -e .

# (Optional) confirm Torch + CUDA availability
python - <<'PY'
import torch
print('torch', torch.__version__, 'cuda', torch.cuda.is_available())
PY

Project depends on torchaudio; install TorchCodec if your environment lacks the built-in encoder support.

Workspace Layout

Workspaces live under $WORKSPACE_DIR/tts/<text_name>/ (defaults to /data/workspace). Running the pipeline on republic.txt produces:

/data/workspace/tts/republic/
  manifest.json
  parts/       # text fragments emitted by partition_text()
  normalize/   # NormalizedPart XML (one per part)
  cues/        # CuedScript XML
  audio/       # <text>_pNNN_####_<speaker>.wav + .json metadata

Voice references are resolved from $WORKSPACE_DIR/voices/ (e.g., voices/enoch.wav). Provide speaker:voice pairs through the CLI when synthesizing.

Quickstart

Place your EPUB, PDF, or plain-text source in /data/workspace/in/.
Export your OpenAI key: export OPENAI_API_KEY=....

Run the toolchain end-to-end:

python -m laban_tts.workflow run \
    --text_file /data/workspace/in/republic.txt \
    --auto \
    --voice_files default:enoch \
    --prepare_conditionals=true

Inspect artifacts between stages (XML, JSON, WAV). Re-run individual stages with --force if you edit upstream files.

Stage Reference

Stage	Command	Inputs	Outputs	Notes
`parts`	`run` (without `--auto`)	Source text	`parts/*.xml`	Partitioned slices sized for LLM context.
`normalize`	`normalize <workspace>`	`parts/*.xml`	`normalize/*-normalized.xml`	Cleans text, classifies speakers + discourse; skips existing files unless `--force`.
`cue`	`cue <workspace>`	`normalize/*.xml`	`cues/*-cues.xml`	LLM produces chunk, rhetoric, profile, emphasis metadata; reuses existing scripts unless `--force`.
`synthesize`	`synthesize <workspace>`	`cues/*.xml`, voice WAVs	`audio/*.wav` + `.json`	Uses chatterbox-tts; reuses existing audio unless `--force`; handles GPU OOM with CPU fallback.
`finalize`	`finalize <workspace>`	`audio/*.wav`	Logs inventory	Placeholder for later concatenation/mixing.

Invoke stages individually:

python -m laban_tts.workflow normalize /data/workspace/tts/republic
python -m laban_tts.workflow cue /data/workspace/tts/republic
python -m laban_tts.workflow synthesize /data/workspace/tts/republic --voice_files default:enoch

Regenerating After Edits

Need to tweak a section?

Edit parts/republic-part007.xml directly.
Re-run normalize (only that part is regenerated).
Re-run cue and synthesize; downstream files for other parts stay untouched.

Use --force to allow stages to overwrite existing directories.

Voices & Prompt Conditioning

Supply a comma-separated list: default:enoch,glaucon:enoch. Speaker identifiers are lowercase.
Per-chunk overrides in CuedScript (audio-prompt attribute) take precedence over CLI flags.
--prepare_conditionals primes chatterbox-tts with the default voice when supported.

Testing & QA

Project policy requires:

black --check .
pyright
pytest -q

Integration smoke test (replace the OpenAI key with your own):

export OPENAI_API_KEY=...
python -m laban_tts.workflow run \
    --text_file /data/workspace/in/republic.txt \
    --auto \
    --voice_files default:enoch

This yields 59 WAV chunks for the sample Republic excerpt and records chunk timing metadata alongside the audio.

Troubleshooting

CUDA out of memory: the loader automatically retries on CPU (tts.load_cuda_failed warning). Expect longer synthesis times but stable output.
Missing voice files: ensure <voice_name>.wav exists under $WORKSPACE_DIR/voices/. Set --voice_files default:<voice> to avoid per-speaker resolution errors.
ffmpeg errors: install binary utilities (apt-get install ffmpeg) so pydub can decode WAV buffers.
LLM quota/timeouts: the pipeline uses structured outputs; failing calls abort the stage. Re-run once quotas recover.

For deeper context, the project scratchpad (scratchpad.md) logs the latest pipeline run, token usage, and validation notes.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.claude-home		.claude-home
.codex		.codex
corpus		corpus
src/laban_tts		src/laban_tts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
compose.yml		compose.yml
pyproject.toml		pyproject.toml
uv-remote.bash		uv-remote.bash
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Laban TTS

Requirements

Installation

Workspace Layout

Quickstart

Stage Reference

Regenerating After Edits

Voices & Prompt Conditioning

Testing & QA

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Laban TTS

Requirements

Installation

Workspace Layout

Quickstart

Stage Reference

Regenerating After Edits

Voices & Prompt Conditioning

Testing & QA

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages