Skip to content

Parallelize multi-chunk inference for ~3-4x speedup on multi-sentence text#147

Open
namanomar wants to merge 1 commit into
KittenML:mainfrom
namanomar:perf/onnx-inference-speedup
Open

Parallelize multi-chunk inference for ~3-4x speedup on multi-sentence text#147
namanomar wants to merge 1 commit into
KittenML:mainfrom
namanomar:perf/onnx-inference-speedup

Conversation

@namanomar

Copy link
Copy Markdown

Summary

generate() previously processed each text chunk (from chunk_text()) strictly sequentially: phonemize, then run ONNX inference, one chunk at a time. For multi-sentence input (the common real-world case, not just the README's one-liner demo), this means N sequential session.run() calls even though they're independent.

This PR splits the two steps:

  • Phonemization stays sequential. eSpeak/phonemizer keeps shared internal state and is not safe to call concurrently — confirmed by reproducing a RuntimeError: number of lines in input and output must be equal when calling it from multiple threads.
  • ONNX inference now runs concurrently across chunks via a ThreadPoolExecutor, since ONNX Runtime sessions support concurrent run() calls on the same session, and profiling showed inference is >99.9% of per-chunk latency (phonemization + tokenization is sub-2ms).

Single-chunk text (the common short-sentence case) is unaffected — it skips the executor entirely and behaves exactly as before.

Benchmarks

On kitten-tts-mini-0.8, a 5-sentence paragraph, 12-core CPU:

  • Sequential (current behavior): ~37-38s
  • Parallel (this PR): ~8-9s (~4x speedup)

I also benchmarked manual SessionOptions tuning (explicit thread counts, ORT_ENABLE_EXTENDED, ORT_PARALLEL execution mode) as an alternative approach — every manual override I tried was slower than ORT's auto-tuned defaults (e.g. forcing all 12 threads was 2.5x slower than the default). So this PR doesn't touch SessionOptions at all; it only changes how chunks are dispatched.

Verification

  • Existing test suite passes (python -m unittest discover -s tests).
  • No NaN/Inf, no clipping, stable RMS across 5 repeated runs of the same multi-chunk input.
  • Verified across multiple voices (Bella, Hugo, Kiki, Jasper) and longer inputs (up to 15 chunks).
  • Note: the model itself has inherent stochastic sampling (two purely sequential calls on identical input already produce different output, e.g. max abs diff ~0.2-0.3) — this is pre-existing behavior on main, unrelated to this change, and confirmed by testing main directly before making any modifications.

Scope

Only kittentts/onnx_model.py is touched. generate_stream() is left untouched/sequential on purpose, since streaming cares about low time-to-first-chunk rather than total throughput, and parallelizing it would change that latency characteristic.

… text

generate() previously ran each text chunk's phonemization and ONNX
inference strictly sequentially. Phonemization (espeak) keeps shared
internal state and isn't safe to call concurrently, but ONNX Runtime
sessions support concurrent run() calls. Splitting these two steps lets
inference across chunks run in a thread pool while keeping phonemization
sequential, since it's the inference step that dominates latency (>99.9%
of per-chunk time per profiling).

Benchmarked on a 5-sentence paragraph (kitten-tts-mini-0.8, 12-core CPU):
~37s sequential -> ~8-9s parallel. Single-chunk text is unaffected (no
executor overhead, same as before). Verified no NaN/Inf, no clipping,
and stable output across repeated runs, multiple voices, and longer
multi-chunk inputs (up to 15 chunks).

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant