feat(realtime): Semantic VAD EOU token#10444
Open
richiejp wants to merge 16 commits into
Open
Conversation
…ptResult.eou New bidi RPC for live-microphone ASR: config-first request stream carrying mono float PCM, response stream carrying transcript deltas plus an eou flag for the model's end-of-utterance token, with a ready ack so callers can detect unsupported backends synchronously and a terminal final_result on CloseSend. TranscriptResult gains an eou field for the unary path. Assisted-by: Claude Code:claude-fable-5
…and embed Scaffolding follows the AudioTransformStream pattern: AIModel interface method, base.Base default returning the typed Unimplemented signal (grpcerrors.LiveTranscriptionUnsupported — Unimplemented rather than FailedPrecondition, which IsModelNotLoaded claims), server recv/send pump, duplex client wrapper, and in-process embed facades. Two deliberate deviations from the template, both serving the ready-ack contract (callers block on the first Recv to detect unsupported backends synchronously): the server pump returns the backend error without waiting for the client to close its send side, and the embed goroutine stashes the terminal error before closing the response channel so it cannot be lost. Assisted-by: Claude Code:claude-fable-5
AudioTranscriptionLive drives one cache-aware streaming session per RPC
over the already-bound stream_begin/feed/finalize C API: config-first,
ready ack (or the typed Unimplemented signal for non-streaming models),
{delta, eou, words} per feed, decoder-auto-reset across utterances, and a
terminal final_result on send-close.
The engine mutex is now taken per C call instead of for a whole stream's
lifetime — safe because every parakeet.cpp call builds its own graph and
all streaming caches live in the session object; the only ctx-shared state
is last_error, which is read under the same lock as the failing call. A
long-lived live session therefore no longer starves batched unary
transcription (and vice versa: the file-based streaming path stops
blocking everything for the duration of a clip). Free() now also takes the
lock so a live feed can never race a model unload into a freed ctx.
The offline decode of the realtime EOU model keeps the literal <EOU>/<EOB>
token in its text; strip it from user-visible transcripts and surface it
as TranscriptResult.eou instead — the signal the realtime retranscribe
gate consumes.
Assisted-by: Claude Code:claude-fable-5
…ction config
backend.ModelTranscriptionLive opens the bidirectional RPC, sends the
session config, and blocks on the backend's ready ack so unsupported
backends (grpcerrors.IsLiveTranscriptionUnsupported) are detected
synchronously and callers can degrade; events then flow from a background
recv goroutine and Close drains the terminal Final event.
pipeline.turn_detection {type, eagerness, retranscribe} sets the
server-side default turn-detection mode for realtime sessions so clients
need no session.update to benefit; schema.TranscriptionResult gains the
eou flag carried over from the proto.
Assisted-by: Claude Code:claude-fable-5
…ibe gate Implement turn_detection.type=semantic_vad (previously parsed but dead — toggleVAD never started the loop for it): the transcription model is fed the microphone audio live while the user speaks and its end-of-utterance token drives a dynamic silence window — 300ms once the token fires, the eagerness fallback (low 8s / medium 4s / high 2s) when it does not — so pausing to think no longer gets cut off while finished sentences get a fast response. Silero VAD stays in charge of speech_started/barge-in and the silence measurement, so a spurious EOU cannot interrupt ongoing speech, and an EOU is cancelled when the VAD sees speech resume after it. One live stream per turn (begun at first speech, finalized at commit) keeps the C session bounded; the finalize flush completes the trailing words, so the streamed transcript is reused at commit — delta events are replayed and the model runs once per utterance. With pipeline.turn_detection.retranscribe, EOU-triggered commits are instead cross-checked by an offline decode: no trailing <EOU> there means keep listening, otherwise the batch transcript is used and the streamed-vs- batch pair is logged as an alignment diagnostic. Backends without the live RPC degrade once, loudly, to silence-only detection with the eagerness window. The manual input_audio_buffer.commit guard now also covers semantic sessions (it previously cleared the buffer under the VAD loop's feet). Coverage baseline ratcheted 45.0 -> 48.5. Assisted-by: Claude Code:claude-fable-5
Two stacked failures kept transcription clips on the Traces page unplayable: - the waveform peaks renderer fetch()es the player src, and the CSP's connect-src allows blob: but not data:, so the data:audio/wav URL the page built was blocked outright. Decode the base64 snippet into a Blob and hand the player an object URL instead. - raising tracing_max_body_bytes at runtime (Settings -> Tracing) only reached the producers: RecordBackendTrace capped Data fields with the value frozen by the first InitBackendTracingIfEnabled call, so trace.AudioSnippet (which reads the live value) embedded a clip the recorder then stomped into '<truncated: N bytes>' — which the page dutifully glued into the data URL. The cap now follows the latest call (every recording path passes the current appConfig value); the ring-buffer size keeps first-call semantics. The page also degrades readably now: a payload that is not decodable base64 (traces recorded truncated by an older server) renders a note instead of a broken player. Assisted-by: Claude Code:claude-fable-5
…e window The end-of-utterance token already trails the audio by the encoder chunk schedule plus a VAD tick, and the commit check only runs once silero has closed the speech segment, which itself requires real silence. Stacking another 0.3s window on top of those was pure added response delay, noticeable in live conversation. Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
The no-speech clear nil'ed the whole input buffer, but 'no segments' is not 'no speech': silero crosses its 0.5 threshold up to a few hundred ms into a soft word onset, so the newest audio in the inspected window was often the start of a word the next tick would have recognized — and audio appended while the tick ran was wiped with it. Up to a full word vanished from the front of utterances, depending on where the onset fell relative to the clear. Keep a 0.5s holdback of the inspected window plus all mid-tick appends when clearing on no-speech (the long-standing TODO), and preserve mid-tick appends at commit time in server_vad mode too — semantic_vad already did. Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
model_load traces were failure-only, so the Traces page couldn't answer 'which backend build served this model' — exactly the question that would have exposed a stale installed backend (the realtime semantic_vad session was silently degrading because the installed parakeet-cpp predated the AudioTranscriptionLive RPC). Add a load observer hook to the ModelLoader, fired once per actual load attempt (cache hits and coalesced loads never reach it; distributed-mode routing is excluded since the worker owns the real load), carrying the alias-resolved backend and the resolved runtime URI — the installed backend's launcher path, which names the variant directory. The core wires it to a 'Model loaded' backend trace; failures stay with the modality wrappers' recordModelLoadFailure to avoid duplicate rows. The code move also brings the relocated block up to lint: clone the shared gRPC options with proto.Clone instead of a mutex-copying struct copy, and log (rather than drop) Stop errors during failed-load cleanup. Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
…telemetry A semantic_vad session with retranscribe off never touches the unary transcription wrapper, so streaming-only pipelines produced no transcription traces at all. The live session now records one trace per turn at Close: final text, eou flag, event counts, and an audio snippet built from the fed PCM (bounded by the existing 30s snippet cap), with source=live_stream to tell it apart from batch decodes. To answer 'is the EOU token slow or is it the loop', every semantic_vad commit now logs its decomposed timing — trigger (eou|timeout), the VAD's speech end, the token's lag behind it, and the silence accumulated at commit. And the loop drains live events a second time after runVAD: an EOU produced by this tick's feed was previously left for the next tick, adding 300ms to every EOU-triggered commit. Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
…anups The three dialing bidi stream wrappers (AudioTranscriptionLive, AudioTransformStream, AudioToAudioStream) closed their gRPC connection inside CloseSend. All three protocols deliver responses AFTER the client closes its send side — the live-ASR FinalResult that flushes the decode tail, the transform/audio response tails — so every session died with 'grpc: the client connection is closing' on the pending Recv and lost its terminal payload. The realtime semantic_vad path hit this on every turn (live transcription finalize failed), dropping the final transcript and its tail words. Adopt forwardClient's proven lifecycle: CloseSend only closes the send side; the conn (and watchdog busy-mark) is released exactly once when Recv reaches a terminal state. The existing live-ASR test runs on the embedded test:// path where no ClientConn exists, which is why it never caught this — the new regression test drives the same contract through a real TCP dial and fails with the exact production error against the old code. Also a cleanup pass over the branch: pkg/sound gains Float32sToInt16LEBytes (the clamping conversion the live trace accumulator open-coded; addPCM now bulk-appends one converted chunk per feed), the live session config uses the liveSampleRate constant, and grpcModel resolves the external-backend URI once instead of rebuilding the merged backend map twice per load. Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
… EOU/EOB split The pinned e270af73 rebuilt the full mel spectrogram on every stream_feed, quadratic in turn length: live decodes fell behind real time as a turn grew, so the <EOU> the semantic_vad turn detector waits on arrived seconds late (2.7s at 6s of audio in production logs) or after the eagerness timeout entirely. 4cd0e4e replaces it with an incremental StreamingMel and splits <EOU> from <EOB> across the C boundary (eou_out bitmask, separate JSON eou/eob flags). Adapt the bindings: the stream JSON gains the eob field, the text path reads the v5 bitmask (bit0 <EOU>, bit1 <EOB>), and stripEouMarker no longer reports a backchannel-ended offline decode as an utterance end — so the retranscribe gate cannot be confirmed by an 'uh-huh'. The live RPC forwards eob (new proto field) and the realtime turn detector logs backchannels without arming the commit threshold. Live feeds also log per-feed decode latency and warn once when the decode falls more than a second behind real time — the failure mode v4 made chronic. Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
…le the user speaks semantic_vad accumulated the live stream's deltas server-side and only replayed them at commit, so the browser showed nothing until the turn ended — defeating the point of a live decode. The turn's conversation item id is now allocated when the turn OPENS; caption deltas stream to the client under it as they drain, the committed event and the completed transcript reuse it (so the authoritative text — including a retranscribe batch correction — replaces the partial instead of duplicating it), and a turn discarded before commit emits the transcription failed event so clients retract its captions. Talk.jsx learns the other half: input transcription delta events upsert a user entry keyed by item_id (the same idempotent upsert the assistant side already used), completed replaces it (or appends unkeyed for server_vad), and failed removes it. Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
semantic_vad recorded the end-of-utterance token in a boolean, eouSeen, that observeSegments then cleared on any tick where silero still showed the speech segment open (End == 0). But the parakeet <EOU> is predictive — it fires as the user stops, before silero finishes its silence padding and closes the segment — so eouSeen was cleared on the very tick it was set, the commit fell through to the eagerness timeout, and turns ended 2-4s late (trigger=timeout in the logs) even though the token had been observed. Replace the toggled flag with a recorded fact: eouAtSec, the audio position of the most recent EOU, set when it drains and never flipped off mid-turn. Whether it still governs the trailing silence is now a pure function, eouPending(segments), computed once per tick in handleVAD and threaded into the threshold, the commit-trigger log, and the retranscribe gate. An EOU stops applying only when the user STARTS a new utterance after it (a segment whose start is past the EOU) — genuine resumed speech — not when silero is merely slow to close the current one. The retranscribe-reject path consumes the recorded EOU (eouAtSec = 0) as a deliberate transition rather than a toggle-back. Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Bash]
With clause chunking, each completed clause was synthesized inline in the LLM token callback. That callback runs on the goroutine draining the model's gRPC stream, and emitSpeech blocks until the whole clause is synthesized (for WebRTC, until it has been played back at real time), so every clause boundary stalled token generation AND froze the assistant transcript the client sees — the LLM and TTS ran strictly serially despite streaming being on. Move synthesis to a single worker goroutine (ttsPipeline). The token callback now just sends the transcript delta and hands each clause to the worker via a non-blocking enqueue, so the recv loop keeps draining tokens and the transcript keeps streaming while audio is produced behind it. The clause queue is unbounded — clauses are short strings, a reply has a bounded number of them, and the costly product (audio) is paced by the backend regardless — so enqueue never applies backpressure to generation. One worker preserves clause and audio ordering; both transports are already safe for the now-concurrent transcript and audio sends (WS guards SendEvent with a mutex, WebRTC sends events to a channel and guards RTP writes with rtpMu). wait() drains and joins the worker, returning the accumulated audio and the first synthesis error; it is idempotent, so streamLLMResponse drains it explicitly to read the audio and also defers a wait() as a leak-proof backstop against future early returns. Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash]
The Talk page rendered the pipeline's VAD/Transcription/LLM/TTS models in a four-column grid. Grid tracks default to min-width:auto, so a long model name with white-space:nowrap forced its track wider than its 1fr share and overflowed the container (the inner overflow:hidden could not shrink the track). Lay them out as a vertical key-value list instead: each value gets the full row width and wraps (minWidth:0 + overflowWrap:anywhere) rather than overflowing, with a dash fallback for an unset component. Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Bash]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Use the EOU token from Parakeet transciption to implement semantic VAD.
Notes for Reviewers
Signed commits