Add livekit-plugins-funasr (FunASR/SenseVoice local STT)#6176
Add livekit-plugins-funasr (FunASR/SenseVoice local STT)#6176LauraGPT wants to merge 10 commits into
Conversation
| resampler = rtc.AudioResampler( | ||
| combined.sample_rate, _SAMPLE_RATE, num_channels=channels | ||
| ) |
There was a problem hiding this comment.
π© AudioResampler called with num_channels= unlike any other usage in the repo
The rtc.AudioResampler() call on line 123 passes num_channels=channels, but across 25+ other AudioResampler usages in the codebase (base STT class at livekit-agents/livekit/agents/stt/stt.py:480, silero VAD, openai realtime, etc.), none pass num_channels. All other callers use only input_rate/output_rate (or positional equivalents) and occasionally quality. I couldn't verify the actual rtc.AudioResampler constructor signature since the livekit-rtc native package isn't available in this environment. If num_channels is not a valid parameter, this would cause a TypeError at runtime whenever combined.sample_rate != 16000. Worth verifying against the livekit-rtc API docs.
Was this helpful? React with π or π to provide feedback.
| def _run() -> str: | ||
| result = self._model.generate( | ||
| input=samples, | ||
| cache={}, | ||
| language=lang, | ||
| use_itn=self._opts.use_itn, | ||
| ) | ||
| return result[0]["text"] if result else "" | ||
|
|
||
| try: | ||
| raw = await asyncio.to_thread(_run) |
There was a problem hiding this comment.
π Info: asyncio.Lock correctly serializes concurrent inference calls
The asyncio.Lock at line 92 is used to serialize concurrent _recognize_impl calls that dispatch _run() to a thread via asyncio.to_thread. Since the lock is acquired before dispatching and held until the thread completes, only one _run() executes at a time, which correctly protects the non-thread-safe self._model.generate. The lock is created in __init__, which is safe in Python 3.10+ since asyncio.Lock() no longer binds to an event loop at creation time. The _run() closure does read self._opts.use_itn at execution time (rather than capturing it), which means a concurrent update_options() call could change the value between closure creation and execution, but this is a minor TOCTOU that's consistent with patterns across the codebase.
Was this helpful? React with π or π to provide feedback.
|
Thanks for the review! Addressed the findings:
CI is green (ruff + type-check). The transcription core (16-bit PCM -> resample to 16k -> SenseVoice -> cleaned text) was verified locally. |
| try: | ||
| async with self._lock: | ||
| raw = await asyncio.to_thread(_run) | ||
| except Exception as e: | ||
| raise APIConnectionError("failed to run FunASR inference") from e |
There was a problem hiding this comment.
π‘ All exceptions wrapped as retryable APIConnectionError causes needless retries of deterministic local-inference failures
The blanket except Exception at line 151 wraps every error from FunASR inference (e.g. KeyError from unexpected model output, RuntimeError from CUDA OOM, ValueError from bad input) as APIConnectionError, which defaults to retryable=True. The base class recognize() method (livekit-agents/livekit/agents/stt/stt.py:227) catches APIError (the parent of APIConnectionError) and retries. For a fully-local inference model, errors are virtually never transient β retrying a deterministic failure like an OOM or bad model output wastes time and delays the real error being surfaced to the caller.
| try: | |
| async with self._lock: | |
| raw = await asyncio.to_thread(_run) | |
| except Exception as e: | |
| raise APIConnectionError("failed to run FunASR inference") from e | |
| try: | |
| async with self._lock: | |
| raw = await asyncio.to_thread(_run) | |
| except Exception as e: | |
| raise APIConnectionError( | |
| "failed to run FunASR inference", retryable=False | |
| ) from e |
Was this helpful? React with π or π to provide feedback.
This PR adds
livekit-plugins-funasr, a local speech-to-text plugin backed by FunASR / SenseVoice.Why: SenseVoice is an open-source, fully on-device, non-autoregressive multilingual ASR model (Chinese, Cantonese, English, Japanese, Korean and more) with strong Chinese accuracy and fast inference β a useful local STT for agents, particularly for Chinese/Cantonese where Whisper is weaker. It runs locally, so no API key is required.
What it does:
FunASRSTT(stt.STT)(non-streamingSegmentedSTTService-compatible STT)._recognize_implcombines the audio buffer, resamples to 16 kHz mono viartc.AudioResampler, runs the local model, strips the rich tags withrich_transcription_postprocess, and returns aSpeechEvent(also reporting the auto-detected language).model(defaultiic/SenseVoiceSmall),device,language(auto-detect when unset) anduse_itn.livekit-plugins-fal); registered in[tool.uv.sources].Verification (run locally):
livekitAPIs:rtc.combine_audio_frames+rtc.AudioResampler(48 kHzβ16 kHz and 16 kHz pass-through) andSpeechEvent/SpeechDataconstruction.Note:
uv.locklikely needs regeneration for the new workspace package; happy to follow whatever the maintainers prefer for that and for adding tests.