Skip to content

feat(realtime): Semantic VAD EOU token#10444

Open
richiejp wants to merge 16 commits into
mudler:masterfrom
richiejp:feat/realtime-semantic-vad-eou
Open

feat(realtime): Semantic VAD EOU token#10444
richiejp wants to merge 16 commits into
mudler:masterfrom
richiejp:feat/realtime-semantic-vad-eou

Conversation

@richiejp

Copy link
Copy Markdown
Collaborator

Description

Use the EOU token from Parakeet transciption to implement semantic VAD.

Notes for Reviewers

  • feat(grpc): add AudioTranscriptionLive bidirectional RPC and TranscriptResult.eou
  • feat(grpc): wire AudioTranscriptionLive through client, server, base and embed
  • feat(parakeet-cpp): live transcription RPC with per-call engine locking
  • feat(core): live transcription session wrapper and pipeline turn_detection config
  • feat(realtime): EOU-driven semantic_vad turn detection with retranscribe gate
  • fix(traces): make audio clips playable — blob URLs and a live body cap
  • feat(realtime): commit immediately on EOU, drop the extra 0.3s silence window
  • fix(realtime): stop cutting the start of utterances on VAD buffer clears
  • feat(trace): record a model_load trace for every successful backend load
  • feat(realtime): per-turn live transcription traces and commit timing telemetry
  • fix(grpc): release bidi stream conns on terminal Recv; PCM/lookup cleanups
  • feat(parakeet-cpp): bump to parakeet.cpp ABI v5 — incremental mel and EOU/EOB split
  • feat(realtime): live input captions — stream transcription deltas while the user speaks
  • fix(realtime): stop dropping the EOU token to the eagerness timeout
  • feat(realtime): synthesize clauses off the LLM token callback
  • fix(ui): show the realtime pipeline components as a vertical list

Signed commits

  • Yes, I signed my commits.

richiejp added 16 commits June 22, 2026 13:06
…ptResult.eou

New bidi RPC for live-microphone ASR: config-first request stream carrying
mono float PCM, response stream carrying transcript deltas plus an eou flag
for the model's end-of-utterance token, with a ready ack so callers can
detect unsupported backends synchronously and a terminal final_result on
CloseSend. TranscriptResult gains an eou field for the unary path.

Assisted-by: Claude Code:claude-fable-5
…and embed

Scaffolding follows the AudioTransformStream pattern: AIModel interface
method, base.Base default returning the typed Unimplemented signal
(grpcerrors.LiveTranscriptionUnsupported — Unimplemented rather than
FailedPrecondition, which IsModelNotLoaded claims), server recv/send pump,
duplex client wrapper, and in-process embed facades.

Two deliberate deviations from the template, both serving the ready-ack
contract (callers block on the first Recv to detect unsupported backends
synchronously): the server pump returns the backend error without waiting
for the client to close its send side, and the embed goroutine stashes the
terminal error before closing the response channel so it cannot be lost.

Assisted-by: Claude Code:claude-fable-5
AudioTranscriptionLive drives one cache-aware streaming session per RPC
over the already-bound stream_begin/feed/finalize C API: config-first,
ready ack (or the typed Unimplemented signal for non-streaming models),
{delta, eou, words} per feed, decoder-auto-reset across utterances, and a
terminal final_result on send-close.

The engine mutex is now taken per C call instead of for a whole stream's
lifetime — safe because every parakeet.cpp call builds its own graph and
all streaming caches live in the session object; the only ctx-shared state
is last_error, which is read under the same lock as the failing call. A
long-lived live session therefore no longer starves batched unary
transcription (and vice versa: the file-based streaming path stops
blocking everything for the duration of a clip). Free() now also takes the
lock so a live feed can never race a model unload into a freed ctx.

The offline decode of the realtime EOU model keeps the literal <EOU>/<EOB>
token in its text; strip it from user-visible transcripts and surface it
as TranscriptResult.eou instead — the signal the realtime retranscribe
gate consumes.

Assisted-by: Claude Code:claude-fable-5
…ction config

backend.ModelTranscriptionLive opens the bidirectional RPC, sends the
session config, and blocks on the backend's ready ack so unsupported
backends (grpcerrors.IsLiveTranscriptionUnsupported) are detected
synchronously and callers can degrade; events then flow from a background
recv goroutine and Close drains the terminal Final event.

pipeline.turn_detection {type, eagerness, retranscribe} sets the
server-side default turn-detection mode for realtime sessions so clients
need no session.update to benefit; schema.TranscriptionResult gains the
eou flag carried over from the proto.

Assisted-by: Claude Code:claude-fable-5
…ibe gate

Implement turn_detection.type=semantic_vad (previously parsed but dead —
toggleVAD never started the loop for it): the transcription model is fed
the microphone audio live while the user speaks and its end-of-utterance
token drives a dynamic silence window — 300ms once the token fires, the
eagerness fallback (low 8s / medium 4s / high 2s) when it does not — so
pausing to think no longer gets cut off while finished sentences get a
fast response. Silero VAD stays in charge of speech_started/barge-in and
the silence measurement, so a spurious EOU cannot interrupt ongoing
speech, and an EOU is cancelled when the VAD sees speech resume after it.

One live stream per turn (begun at first speech, finalized at commit)
keeps the C session bounded; the finalize flush completes the trailing
words, so the streamed transcript is reused at commit — delta events are
replayed and the model runs once per utterance. With
pipeline.turn_detection.retranscribe, EOU-triggered commits are instead
cross-checked by an offline decode: no trailing <EOU> there means keep
listening, otherwise the batch transcript is used and the streamed-vs-
batch pair is logged as an alignment diagnostic.

Backends without the live RPC degrade once, loudly, to silence-only
detection with the eagerness window. The manual input_audio_buffer.commit
guard now also covers semantic sessions (it previously cleared the buffer
under the VAD loop's feet).

Coverage baseline ratcheted 45.0 -> 48.5.

Assisted-by: Claude Code:claude-fable-5
Two stacked failures kept transcription clips on the Traces page
unplayable:

- the waveform peaks renderer fetch()es the player src, and the CSP's
  connect-src allows blob: but not data:, so the data:audio/wav URL the
  page built was blocked outright. Decode the base64 snippet into a Blob
  and hand the player an object URL instead.
- raising tracing_max_body_bytes at runtime (Settings -> Tracing) only
  reached the producers: RecordBackendTrace capped Data fields with the
  value frozen by the first InitBackendTracingIfEnabled call, so
  trace.AudioSnippet (which reads the live value) embedded a clip the
  recorder then stomped into '<truncated: N bytes>' — which the page
  dutifully glued into the data URL. The cap now follows the latest call
  (every recording path passes the current appConfig value); the
  ring-buffer size keeps first-call semantics.

The page also degrades readably now: a payload that is not decodable
base64 (traces recorded truncated by an older server) renders a note
instead of a broken player.

Assisted-by: Claude Code:claude-fable-5
…e window

The end-of-utterance token already trails the audio by the encoder
chunk schedule plus a VAD tick, and the commit check only runs once
silero has closed the speech segment, which itself requires real
silence. Stacking another 0.3s window on top of those was pure added
response delay, noticeable in live conversation.

Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
The no-speech clear nil'ed the whole input buffer, but 'no segments'
is not 'no speech': silero crosses its 0.5 threshold up to a few
hundred ms into a soft word onset, so the newest audio in the
inspected window was often the start of a word the next tick would
have recognized — and audio appended while the tick ran was wiped
with it. Up to a full word vanished from the front of utterances,
depending on where the onset fell relative to the clear.

Keep a 0.5s holdback of the inspected window plus all mid-tick
appends when clearing on no-speech (the long-standing TODO), and
preserve mid-tick appends at commit time in server_vad mode too —
semantic_vad already did.

Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
model_load traces were failure-only, so the Traces page couldn't answer
'which backend build served this model' — exactly the question that
would have exposed a stale installed backend (the realtime semantic_vad
session was silently degrading because the installed parakeet-cpp
predated the AudioTranscriptionLive RPC).

Add a load observer hook to the ModelLoader, fired once per actual load
attempt (cache hits and coalesced loads never reach it; distributed-mode
routing is excluded since the worker owns the real load), carrying the
alias-resolved backend and the resolved runtime URI — the installed
backend's launcher path, which names the variant directory. The core
wires it to a 'Model loaded' backend trace; failures stay with the
modality wrappers' recordModelLoadFailure to avoid duplicate rows.

The code move also brings the relocated block up to lint: clone the
shared gRPC options with proto.Clone instead of a mutex-copying struct
copy, and log (rather than drop) Stop errors during failed-load cleanup.

Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
…telemetry

A semantic_vad session with retranscribe off never touches the unary
transcription wrapper, so streaming-only pipelines produced no
transcription traces at all. The live session now records one trace per
turn at Close: final text, eou flag, event counts, and an audio snippet
built from the fed PCM (bounded by the existing 30s snippet cap), with
source=live_stream to tell it apart from batch decodes.

To answer 'is the EOU token slow or is it the loop', every semantic_vad
commit now logs its decomposed timing — trigger (eou|timeout), the
VAD's speech end, the token's lag behind it, and the silence accumulated
at commit. And the loop drains live events a second time after runVAD:
an EOU produced by this tick's feed was previously left for the next
tick, adding 300ms to every EOU-triggered commit.

Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
…anups

The three dialing bidi stream wrappers (AudioTranscriptionLive,
AudioTransformStream, AudioToAudioStream) closed their gRPC connection
inside CloseSend. All three protocols deliver responses AFTER the
client closes its send side — the live-ASR FinalResult that flushes the
decode tail, the transform/audio response tails — so every session died
with 'grpc: the client connection is closing' on the pending Recv and
lost its terminal payload. The realtime semantic_vad path hit this on
every turn (live transcription finalize failed), dropping the final
transcript and its tail words.

Adopt forwardClient's proven lifecycle: CloseSend only closes the send
side; the conn (and watchdog busy-mark) is released exactly once when
Recv reaches a terminal state. The existing live-ASR test runs on the
embedded test:// path where no ClientConn exists, which is why it never
caught this — the new regression test drives the same contract through
a real TCP dial and fails with the exact production error against the
old code.

Also a cleanup pass over the branch: pkg/sound gains
Float32sToInt16LEBytes (the clamping conversion the live trace
accumulator open-coded; addPCM now bulk-appends one converted chunk per
feed), the live session config uses the liveSampleRate constant, and
grpcModel resolves the external-backend URI once instead of rebuilding
the merged backend map twice per load.

Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
… EOU/EOB split

The pinned e270af73 rebuilt the full mel spectrogram on every
stream_feed, quadratic in turn length: live decodes fell behind real
time as a turn grew, so the <EOU> the semantic_vad turn detector waits
on arrived seconds late (2.7s at 6s of audio in production logs) or
after the eagerness timeout entirely. 4cd0e4e replaces it with an
incremental StreamingMel and splits <EOU> from <EOB> across the C
boundary (eou_out bitmask, separate JSON eou/eob flags).

Adapt the bindings: the stream JSON gains the eob field, the text path
reads the v5 bitmask (bit0 <EOU>, bit1 <EOB>), and stripEouMarker no
longer reports a backchannel-ended offline decode as an utterance end —
so the retranscribe gate cannot be confirmed by an 'uh-huh'. The live
RPC forwards eob (new proto field) and the realtime turn detector logs
backchannels without arming the commit threshold. Live feeds also log
per-feed decode latency and warn once when the decode falls more than a
second behind real time — the failure mode v4 made chronic.

Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
…le the user speaks

semantic_vad accumulated the live stream's deltas server-side and only
replayed them at commit, so the browser showed nothing until the turn
ended — defeating the point of a live decode. The turn's conversation
item id is now allocated when the turn OPENS; caption deltas stream to
the client under it as they drain, the committed event and the
completed transcript reuse it (so the authoritative text — including a
retranscribe batch correction — replaces the partial instead of
duplicating it), and a turn discarded before commit emits the
transcription failed event so clients retract its captions.

Talk.jsx learns the other half: input transcription delta events upsert
a user entry keyed by item_id (the same idempotent upsert the assistant
side already used), completed replaces it (or appends unkeyed for
server_vad), and failed removes it.

Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
semantic_vad recorded the end-of-utterance token in a boolean, eouSeen,
that observeSegments then cleared on any tick where silero still showed
the speech segment open (End == 0). But the parakeet <EOU> is predictive
— it fires as the user stops, before silero finishes its silence padding
and closes the segment — so eouSeen was cleared on the very tick it was
set, the commit fell through to the eagerness timeout, and turns ended
2-4s late (trigger=timeout in the logs) even though the token had been
observed.

Replace the toggled flag with a recorded fact: eouAtSec, the audio
position of the most recent EOU, set when it drains and never flipped
off mid-turn. Whether it still governs the trailing silence is now a
pure function, eouPending(segments), computed once per tick in handleVAD
and threaded into the threshold, the commit-trigger log, and the
retranscribe gate. An EOU stops applying only when the user STARTS a new
utterance after it (a segment whose start is past the EOU) — genuine
resumed speech — not when silero is merely slow to close the current
one. The retranscribe-reject path consumes the recorded EOU
(eouAtSec = 0) as a deliberate transition rather than a toggle-back.

Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Bash]
With clause chunking, each completed clause was synthesized inline in
the LLM token callback. That callback runs on the goroutine draining the
model's gRPC stream, and emitSpeech blocks until the whole clause is
synthesized (for WebRTC, until it has been played back at real time), so
every clause boundary stalled token generation AND froze the assistant
transcript the client sees — the LLM and TTS ran strictly serially
despite streaming being on.

Move synthesis to a single worker goroutine (ttsPipeline). The token
callback now just sends the transcript delta and hands each clause to
the worker via a non-blocking enqueue, so the recv loop keeps draining
tokens and the transcript keeps streaming while audio is produced
behind it. The clause queue is unbounded — clauses are short strings, a
reply has a bounded number of them, and the costly product (audio) is
paced by the backend regardless — so enqueue never applies backpressure
to generation. One worker preserves clause and audio ordering; both
transports are already safe for the now-concurrent transcript and audio
sends (WS guards SendEvent with a mutex, WebRTC sends events to a
channel and guards RTP writes with rtpMu).

wait() drains and joins the worker, returning the accumulated audio and
the first synthesis error; it is idempotent, so streamLLMResponse drains
it explicitly to read the audio and also defers a wait() as a leak-proof
backstop against future early returns.

Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash]
The Talk page rendered the pipeline's VAD/Transcription/LLM/TTS models
in a four-column grid. Grid tracks default to min-width:auto, so a long
model name with white-space:nowrap forced its track wider than its 1fr
share and overflowed the container (the inner overflow:hidden could not
shrink the track). Lay them out as a vertical key-value list instead:
each value gets the full row width and wraps (minWidth:0 +
overflowWrap:anywhere) rather than overflowing, with a dash fallback for
an unset component.

Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Bash]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant