Bug Description
blingfire.SentenceTokenizer — the default sentence tokenizer used by the TTS StreamAdapter — emits a trailing whitespace-only token when retain_format=True and the input ends in whitespace.
In livekit/agents/tokenize/blingfire.py::_split_sentences, the trailing text segment is appended unconditionally in retain_format mode, whereas the non-retain path strips and skips empty segments, so the two paths disagree:
if start < len(text):
raw_sentence = text[start:]
if retain_format:
merged_sentences.append((raw_sentence, start, len(text))) # "\n\n" leaks through
elif sentence := raw_sentence.strip():
merged_sentences.append((sentence, start, len(text)))
Reproduction Steps
from livekit.agents.tokenize import blingfire
tok = blingfire.SentenceTokenizer(min_sentence_len=20, retain_format=True)
print(tok.tokenize("This is a real sentence to speak.\n\n"))
# ['This is a real sentence to speak.', '\n\n'] <-- trailing '\n\n' is a spurious empty sentence
print(blingfire.SentenceTokenizer(min_sentence_len=20).tokenize("This is a real sentence to speak.\n\n"))
# ['This is a real sentence to speak.'] <-- non-retain path correctly drops it
The same empty token is produced by the streamed path (.stream()), and StreamAdapterWrapper._synthesize pushes it into the timed transcript (push_timed_transcript) unconditionally. The audio synth call itself is already guarded by a .strip() check, so the practical impact is tokenizer-contract correctness and clean transcript / .tokenize() output rather than empty TTS requests.
Expected Behavior
retain_format=True should match the non-retain path and not emit whitespace-only trailing segments, while still preserving the original formatting of real trailing content (e.g. a retained "\n\nMore" must be kept intact).
Package Versions
- livekit-agents (main)
- livekit-blingfire ~=1.1
Additional Context
Fix in #6295.
Bug Description
blingfire.SentenceTokenizer— the default sentence tokenizer used by the TTSStreamAdapter— emits a trailing whitespace-only token whenretain_format=Trueand the input ends in whitespace.In
livekit/agents/tokenize/blingfire.py::_split_sentences, the trailing text segment is appended unconditionally inretain_formatmode, whereas the non-retain path strips and skips empty segments, so the two paths disagree:Reproduction Steps
The same empty token is produced by the streamed path (
.stream()), andStreamAdapterWrapper._synthesizepushes it into the timed transcript (push_timed_transcript) unconditionally. The audio synth call itself is already guarded by a.strip()check, so the practical impact is tokenizer-contract correctness and clean transcript /.tokenize()output rather than empty TTS requests.Expected Behavior
retain_format=Trueshould match the non-retain path and not emit whitespace-only trailing segments, while still preserving the original formatting of real trailing content (e.g. a retained"\n\nMore"must be kept intact).Package Versions
Additional Context
Fix in #6295.