Deterministic, file-first audiobook production pipeline: normalize → cue → synthesize. Every stage is an explicit CLI command that reads and writes well-defined artifacts so engineers, directors, and narrators can audit, edit, and rerun any step without hidden state.
parts/ → normalize/ → cues/ → audio/ (→ finalize/)
- Python 3.11 (ships with the provided PyTorch CUDA image).
uvpackage manager (already available in the container).- OpenAI project/API key exported as
OPENAI_API_KEY(used for normalization + cueing viagpt-5-mini-2025-08-07). - ffmpeg in the system path (needed by
pydubto mux WAVs). - Optional GPU (8 GiB VRAM works). The toolchain auto-falls back to CPU if chatterbox-tts cannot load on CUDA.
# From the repository root
uv pip install -e .
# (Optional) confirm Torch + CUDA availability
python - <<'PY'
import torch
print('torch', torch.__version__, 'cuda', torch.cuda.is_available())
PYProject depends on torchaudio; install TorchCodec if your environment lacks the built-in encoder support.
Workspaces live under $WORKSPACE_DIR/tts/<text_name>/ (defaults to /data/workspace). Running the pipeline on republic.txt produces:
/data/workspace/tts/republic/
manifest.json
parts/ # text fragments emitted by partition_text()
normalize/ # NormalizedPart XML (one per part)
cues/ # CuedScript XML
audio/ # <text>_pNNN_####_<speaker>.wav + .json metadata
Voice references are resolved from $WORKSPACE_DIR/voices/ (e.g., voices/enoch.wav). Provide speaker:voice pairs through the CLI when synthesizing.
-
Place your EPUB, PDF, or plain-text source in
/data/workspace/in/. -
Export your OpenAI key:
export OPENAI_API_KEY=.... -
Run the toolchain end-to-end:
python -m laban_tts.workflow run \ --text_file /data/workspace/in/republic.txt \ --auto \ --voice_files default:enoch \ --prepare_conditionals=true -
Inspect artifacts between stages (XML, JSON, WAV). Re-run individual stages with
--forceif you edit upstream files.
| Stage | Command | Inputs | Outputs | Notes |
|---|---|---|---|---|
parts |
run (without --auto) |
Source text | parts/*.xml |
Partitioned slices sized for LLM context. |
normalize |
normalize <workspace> |
parts/*.xml |
normalize/*-normalized.xml |
Cleans text, classifies speakers + discourse; skips existing files unless --force. |
cue |
cue <workspace> |
normalize/*.xml |
cues/*-cues.xml |
LLM produces chunk, rhetoric, profile, emphasis metadata; reuses existing scripts unless --force. |
synthesize |
synthesize <workspace> |
cues/*.xml, voice WAVs |
audio/*.wav + .json |
Uses chatterbox-tts; reuses existing audio unless --force; handles GPU OOM with CPU fallback. |
finalize |
finalize <workspace> |
audio/*.wav |
Logs inventory | Placeholder for later concatenation/mixing. |
Invoke stages individually:
python -m laban_tts.workflow normalize /data/workspace/tts/republic
python -m laban_tts.workflow cue /data/workspace/tts/republic
python -m laban_tts.workflow synthesize /data/workspace/tts/republic --voice_files default:enochNeed to tweak a section?
- Edit
parts/republic-part007.xmldirectly. - Re-run
normalize(only that part is regenerated). - Re-run
cueandsynthesize; downstream files for other parts stay untouched.
Use --force to allow stages to overwrite existing directories.
- Supply a comma-separated list:
default:enoch,glaucon:enoch. Speaker identifiers are lowercase. - Per-chunk overrides in
CuedScript(audio-promptattribute) take precedence over CLI flags. --prepare_conditionalsprimes chatterbox-tts with the default voice when supported.
Project policy requires:
black --check .
pyright
pytest -qIntegration smoke test (replace the OpenAI key with your own):
export OPENAI_API_KEY=...
python -m laban_tts.workflow run \
--text_file /data/workspace/in/republic.txt \
--auto \
--voice_files default:enochThis yields 59 WAV chunks for the sample Republic excerpt and records chunk timing metadata alongside the audio.
- CUDA out of memory: the loader automatically retries on CPU (
tts.load_cuda_failedwarning). Expect longer synthesis times but stable output. - Missing voice files: ensure
<voice_name>.wavexists under$WORKSPACE_DIR/voices/. Set--voice_files default:<voice>to avoid per-speaker resolution errors. - ffmpeg errors: install binary utilities (
apt-get install ffmpeg) sopydubcan decode WAV buffers. - LLM quota/timeouts: the pipeline uses structured outputs; failing calls abort the stage. Re-run once quotas recover.
For deeper context, the project scratchpad (scratchpad.md) logs the latest pipeline run, token usage, and validation notes.