Parallelize multi-chunk inference for ~3-4x speedup on multi-sentence text#147
Open
namanomar wants to merge 1 commit into
Open
Parallelize multi-chunk inference for ~3-4x speedup on multi-sentence text#147namanomar wants to merge 1 commit into
namanomar wants to merge 1 commit into
Conversation
… text generate() previously ran each text chunk's phonemization and ONNX inference strictly sequentially. Phonemization (espeak) keeps shared internal state and isn't safe to call concurrently, but ONNX Runtime sessions support concurrent run() calls. Splitting these two steps lets inference across chunks run in a thread pool while keeping phonemization sequential, since it's the inference step that dominates latency (>99.9% of per-chunk time per profiling). Benchmarked on a 5-sentence paragraph (kitten-tts-mini-0.8, 12-core CPU): ~37s sequential -> ~8-9s parallel. Single-chunk text is unaffected (no executor overhead, same as before). Verified no NaN/Inf, no clipping, and stable output across repeated runs, multiple voices, and longer multi-chunk inputs (up to 15 chunks).
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generate()previously processed each text chunk (fromchunk_text()) strictly sequentially: phonemize, then run ONNX inference, one chunk at a time. For multi-sentence input (the common real-world case, not just the README's one-liner demo), this means N sequentialsession.run()calls even though they're independent.This PR splits the two steps:
RuntimeError: number of lines in input and output must be equalwhen calling it from multiple threads.ThreadPoolExecutor, since ONNX Runtime sessions support concurrentrun()calls on the same session, and profiling showed inference is >99.9% of per-chunk latency (phonemization + tokenization is sub-2ms).Single-chunk text (the common short-sentence case) is unaffected — it skips the executor entirely and behaves exactly as before.
Benchmarks
On
kitten-tts-mini-0.8, a 5-sentence paragraph, 12-core CPU:I also benchmarked manual
SessionOptionstuning (explicit thread counts,ORT_ENABLE_EXTENDED,ORT_PARALLELexecution mode) as an alternative approach — every manual override I tried was slower than ORT's auto-tuned defaults (e.g. forcing all 12 threads was 2.5x slower than the default). So this PR doesn't touchSessionOptionsat all; it only changes how chunks are dispatched.Verification
python -m unittest discover -s tests).main, unrelated to this change, and confirmed by testingmaindirectly before making any modifications.Scope
Only
kittentts/onnx_model.pyis touched.generate_stream()is left untouched/sequential on purpose, since streaming cares about low time-to-first-chunk rather than total throughput, and parallelizing it would change that latency characteristic.