feat(realtime): conversation compaction (summarize-then-drop) + OpenAI item.delete/truncate/clear#10446
Merged
Merged
Conversation
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ai suite Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add a handler for the input_audio_buffer.clear client event that discards a partially-captured utterance (raw PCM + buffered Opus frames) via a unit-tested clearInputAudio helper, then acks with input_audio_buffer.cleared. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Clears both .Text and .Transcript of the assistant content part at contentIndex so barge-in truncation also works for audio turns whose spoken words live in .Transcript. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…avoids panic) Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…y, off-path) Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ary stripping Replace the bespoke <think> regex in the compactor with the shared pkg/reasoning extractor (via spokenReasoningConfig), matching the rest of the realtime path and covering all reasoning tag families, not just <think>. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
TestAllFieldsHaveRegistryEntries requires every ModelConfig field to have a UI/meta registry entry; add the four pipeline.compaction.* leaves so they render with proper labels/descriptions instead of the reflection fallback. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
aee1a53 to
f6edc05
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds server-side conversation compaction to the realtime (voice) API so long sessions stay cheap on CPU without forgetting earlier context. Today a realtime session either feeds the whole growing buffer to the LLM (latency death-spiral on CPU) or, with
max_history_items, silently drops old turns and forgets them. This change lets the server fold aged-out turns into a rolling summary instead.Two layers:
1. OpenAI-parity conversation events
The realtime endpoint was missing client-side history management that the OpenAI Realtime API relies on. Now implemented:
conversation.item.delete— was anot_implementedstub; now removes the item and emitsconversation.item.deleted.conversation.item.truncate— clears an assistant item's text/transcript at a content index (discard an interrupted/barge-in tail).input_audio_buffer.clear— resets pending input audio.2. Summarize-then-drop compactor
Conversation.Memorysummary, kept out ofItemssotrimRealtimeItemscan't drop it, injected into the prompt right after the instructions.conv.Lockacross the summarizer LLM call. Commit re-validates the head (prefixMatches) so a concurrentitem.deletecan't cause lost/misdropped data. Function-call/output pairs are never split across the boundary.Conversationcreation site.Config (two-number model)
max_history_items= live window;compaction.trigger_items= high-water mark (must exceed it). A summary call runs roughly every(trigger_items - max_history_items)turns. The summarizer model is resolved lazily, inside the compaction goroutine (off the response path). Withcompactionabsent/disabled, behavior is byte-for-byte unchanged.Testing
resolveCompaction,itemID,deleteItem,truncateAssistantText,clearInputAudio,compactionCut,withMemory,renderItemsTranscript,buildSummaryMessages,prefixMatches,compact,summarizerModel), incl. the commit/abort and summarizer-error paths.go test -race ./core/http/endpoints/openai/...clean (the compactor spawns a goroutine).make lintreports 0 issues on the feature files.TestOpenAIsuite (singleRunSpecs).Notes / follow-ups (out of scope)
max_summary_tokensis advisory (fed to the prompt) in this PR; a hardPredict-level cap is a follow-up.summary_modelfor CPU loads via the realtime pipeline path and falls back to the pipeline LLM if it can't be constructed.docs/content/features/openai-realtime.md.🤖 Generated with Claude Code
Assisted-by: Claude:claude-opus-4-8 [Claude Code]